Added first stab of an administration guide, added a few items to the deployment guide, cleaned up a coupld of docstring warnings, and removed some unused config options in the proxy

2010-07-30 22:02:02 +00:00 · 2010-07-30 22:02:02 +00:00 · d7ae75aa58
commit d7ae75aa58
parent 138cf266e7 221466a8ed
8 changed files with 261 additions and 14 deletions
--- a/.bzrignore
+++ b/.bzrignore
@ -4,3 +4,4 @@ doc/build/*
 dist
 ChangeLog
 .coverage
+swift.egg-info
--- a/doc/source/admin_guide.rst
+++ b/doc/source/admin_guide.rst
@ -0,0 +1,154 @@
+=====================
+Administrator's Guide
+=====================
+
+------------------
+Managing the Rings
+------------------
+
+Removing a device from the ring::
+
+    swift-ring-builder <builder-file> remove <ip_address>/<device_name>
+    
+Removing a server from the ring::
+
+    swift-ring-builder <builder-file> remove <ip_address>
+    
+Adding devices to the ring:
+
+See :ref:`ring-preparing`
+    
+See what devices for a server are in the ring::
+
+    swift-ring-builder <builder-file> search <ip_address>
+
+Once you are done with all changes to the ring, the changes need to be
+"committed"::
+
+    swift-ring-builder <builder-file> rebalance
+    
+Once the new rings are built, they should be pushed out to all the servers
+in the cluster.
+
+-----------------------
+Handling System Updates
+-----------------------
+
+It is recommended that system updates and reboots are done a zone at a time.
+This allows the update to happen, and for the Swift cluster to stay available
+and responsive to requests.  It is also advisable when updating a zone, let
+it run for a while before updating the other zones to make sure the update
+doesn't have any adverse effects.
+
+----------------------
+Handling Drive Failure
+----------------------
+
+In the event that a drive has failed, the first step is to make sure the drive
+is unmounted.  This will make it easier for swift to work around the failure
+until it has been resolved.  If the drive is going to be replaced immediately,
+then it is just best to replace the drive, format it, remount it, and let
+replication fill it up.
+
+If the drive can't be replaced immediately, then it is best to leave it
+unmounted, and remove the drive from the ring. This will allow all the
+replicas that were on that drive to be replicated elsewhere until the drive
+is replaced.  Once the drive is replaced, it can be re-added to the ring.
+
+-----------------------
+Handling Server Failure
+-----------------------
+
+If a server is having hardware issues, it is a good idea to make sure the 
+swift services are not running.  This will allow Swift to work around the
+failure while you troubleshoot.
+
+If the server just needs a reboot, or a small amount of work that should
+only last a couple of hours, then it is probably best to let Swift work
+around the failure and get the machine fixed and back online.  When the
+machine comes back online, replication will make sure that anything that is
+missing during the downtime will get updated.
+
+If the server has more serious issues, then it is probably best to remove
+all of the server's devices from the ring.  Once the server has been repaired
+and is back online, the server's devices can be added back into the ring.
+It is important that the devices are reformatted before putting them back
+into the ring as it is likely to be responsible for a different set of
+partitions than before.
+
+-----------------------
+Detecting Failed Drives
+-----------------------
+
+It has been our experience that when a drive is about to fail, error messages
+will spew into `/var/log/kern.log`.  There is a script called
+`swift-drive-audit` that can be run via cron to watch for bad drives.  If 
+errors are detected, it will unmount the bad drive, so that Swift can
+work around it.  The script takes a configuration file with the following
+settings:
+
+[drive-audit]
+
+==================  ==========  ===========================================
+Option              Default     Description
+------------------  ----------  -------------------------------------------
+log_facility        LOG_LOCAL0  Syslog log facility
+log_level           INFO        Log level
+device_dir          /srv/node   Directory devices are mounted under
+minutes             60          Number of minutes to look back in
+                                `/var/log/kern.log`
+error_limit         1           Number of errors to find before a device
+                                is unmounted
+==================  ==========  ===========================================
+
+This script has only been tested on Ubuntu 10.04, so if you are using a
+different distro or OS, some care should be taken before using in production.
+
+--------------
+Cluster Health
+--------------
+
+TODO: Greg, add docs here about how to use swift-stats-populate, and
+swift-stats-report
+
+------------------------
+Debugging Tips and Tools
+------------------------
+
+When a request is made to Swift, it is given a unique transaction id.  This
+id should be in every log line that has to do with that request.  This can
+be usefult when looking at all the services that are hit by a single request.
+
+If you need to know where a specific account, container or object is in the
+cluster, `swift-get-nodes` will show the location where each replica should be.
+
+If you are looking at an object on the server and need more info,
+`swift-object-info` will display the account, container, replica locations
+and metadata of the object.
+
+If you want to audit the data for an account, `swift-account-audit` can be
+used to crawl the account, checking that all containers and objects can be
+found.
+
+-----------------
+Managing Services
+-----------------
+
+Swift services are generally managed with `swift-init`. the general usage is
+``swift-init <service> <command>``, where service is the swift service to 
+manage (for example object, container, account, proxy) and command is one of:
+
+==========  ===============================================
+Command     Description
+----------  -----------------------------------------------
+start       Start the service
+stop        Stop the service
+restart     Restart the service
+shutdown    Attempt to gracefully shutdown the service
+reload      Attempt to gracefully restart the service
+==========  ===============================================
+
+A graceful shutdown or reload will finish any current requests before 
+completely stopping the old service.  There is also a special case of 
+`swift-init all <command>`, which will run the command for all swift services.
+
--- a/doc/source/deployment_guide.rst
+++ b/doc/source/deployment_guide.rst
@ -51,6 +51,8 @@ Load balancing and network design is left as an excercise to the reader,
 but this is a very important part of the cluster, so time should be spent
 designing the network for a Swift cluster.

+.. _ring-preparing:
+
 ------------------
 Preparing the Ring
 ------------------
@ -320,7 +322,7 @@ per_diff            1000
 concurrency         8           Number of replication workers to spawn
 run_pause           30          Time in seconds to wait between replication
                                passes
-node_timeout        10           Request timeout to external services
+node_timeout        10          Request timeout to external services
 conn_timeout        0.5         Connection timeout to external services
 reclaim_age         604800      Time elapsed in seconds before a account
                                can be reclaimed
@ -353,6 +355,99 @@ node_timeout        10          Request timeout to external services
 conn_timeout        0.5         Connection timeout to external services
 ==================  ==========  ===========================================

+--------------------------
+Proxy Server Configuration
+--------------------------
+
+[proxy-server]
+
+============================  ===============  =============================
+Option                        Default          Description
+----------------------------  ---------------  -----------------------------
+log_facility                  LOG_LOCAL0       Syslog log facility
+log_level                     INFO             Log level
+bind_ip                       0.0.0.0          IP Address for server to
+                                               bind to
+bind_port                     80               Port for server to bind to
+cert_file                                      Path to the ssl .crt 
+key_file                                       Path to the ssl .key
+swift_dir                     /etc/swift       Swift configuration directory
+log_headers                   True             If True, log headers in each
+                                               request
+workers                       1                Number of workers to fork
+user                          swift            User to run as
+recheck_account_existence     60               Cache timeout in seconds to
+                                               send memcached for account
+                                               existance
+recheck_container_existence   60               Cache timeout in seconds to
+                                               send memcached for container
+                                               existance
+object_chunk_size             65536            Chunk size to read from
+                                               object servers
+client_chunk_size             65536            Chunk size to read from
+                                               clients
+memcache_servers              127.0.0.1:11211  Comma separated list of
+                                               memcached servers ip:port
+node_timeout                  10               Request timeout to external
+                                               services
+client_timeout                60               Timeout to read one chunk
+                                               from a client
+conn_timeout                  0.5              Connection timeout to
+                                               external services
+error_suppression_interval    60               Time in seconds that must
+                                               elapse since the last error
+                                               for a node to be considered
+                                               no longer error limited
+error_suppression_limit       10               Error count to consider a
+                                               node error limited
+rate_limit                    20000.0          Max container level ops per
+                                               second
+account_rate_limit            200.0            Max account level ops per
+                                               second
+rate_limit_account_whitelist                   Comma separated list of 
+                                               account name hashes to not
+                                               rate limit
+rate_limit_account_blacklist                   Comma separated list of
+                                               account name hashes to block
+                                               completly
+============================  ===============  =============================
+
+[auth-server]
+
+============  ===================================  ========================
+Option        Default                              Description
+------------  -----------------------------------  ------------------------
+class         swift.common.auth.DevAuthMiddleware  Auth wsgi middleware
+                                                   to use
+ip            127.0.0.1                            IP address of auth
+                                                   server
+port          11000                                Port of auth server
+node_timeout  10                                   Request timeout
+============  ===================================  ========================
+
+------------------------
+Memcached Considerations
+------------------------
+
+Several of the Services rely on Memcached for caching certain types of
+lookups, such as auth tokens, and container/account existance.  Swift does
+not do any caching of actual object data.  Memcached should be able to run
+on any servers that have available RAM and CPU.  At Rackspace, we run 
+Memcached on the proxy servers.  The `memcache_servers` config option
+in the `proxy-server.conf` should contain all memcached servers.
+
+-----------
+System Time
+-----------
+
+Time may be relative but it is relatively important for Swift!  Swift uses
+timestamps to determine which is the most recent version of an object.
+It is very important for the system time on each server in the cluster to
+by synced as closely as possible (more so for the proxy server, but in general
+it is a good idea for all the servers).  At Rackspace, we use NTP with a local
+NTP server to ensure that the system times are as close as possible.  This
+should also be monitored to ensure that the times do not vary too much.
+
 ----------------------
 General Service Tuning
 ----------------------
--- a/doc/source/index.rst
+++ b/doc/source/index.rst
@ -39,6 +39,7 @@ Deployment:
    :maxdepth: 1

    deployment_guide
+    admin_guide

 Source:

--- a/etc/proxy-server.conf-sample
+++ b/etc/proxy-server.conf-sample
@ -12,8 +12,6 @@
 # recheck_account_existence = 60
 # recheck_container_existence = 60
 # object_chunk_size = 8192
-# container_chunk_size = 8192
-# account_chunk_size = 8192
 # client_chunk_size = 8192
 # Default for memcache_servers is below, but you can specify multiple servers
 # with the format: 10.1.2.3:11211,10.1.2.4:11211
@ -32,7 +30,6 @@
 # account_rate_limit = 200.0
 # rate_limit_account_whitelist = acct1,acct2,etc
 # rate_limit_account_blacklist = acct3,acct4,etc
-# container_put_lock_timeout = 5

 # [auth-server]
 # class = swift.common.auth.DevAuthMiddleware
--- a/swift/account/reaper.py
+++ b/swift/account/reaper.py
@ -201,10 +201,14 @@ class AccountReaper(object):
        :param partition: The partition in the account ring the account is on.
        :param nodes: The primary node dicts for the account to delete.

-        * See also: :class:`swift.common.db.AccountBroker` for the broker
-        class.
-        * See also: :func:`swift.common.ring.Ring.get_nodes` for a description
-          of the node dicts.
+        .. seealso::
+
+            :class:`swift.common.db.AccountBroker` for the broker class.
+
+        .. seealso::
+
+            :func:`swift.common.ring.Ring.get_nodes` for a description
+            of the node dicts.
        """
        begin = time()
        account = broker.get_info()['account']
--- a/swift/obj/replicator.py
+++ b/swift/obj/replicator.py
@ -123,7 +123,7 @@ def invalidate_hash(suffix_dir):
    Invalidates the hash for a suffix_dir in the partition's hashes file.

    :param suffix_dir: absolute path to suffix dir whose hash needs
-    invalidating
+                       invalidating
    """

    suffix = os.path.basename(suffix_dir)
--- a/swift/proxy/server.py
+++ b/swift/proxy/server.py
@ -949,9 +949,6 @@ class BaseApplication(object):
        self.conn_timeout = float(conf.get('conn_timeout', 0.5))
        self.client_timeout = int(conf.get('client_timeout', 60))
        self.object_chunk_size = int(conf.get('object_chunk_size', 65536))
-        self.container_chunk_size = \
-            int(conf.get('container_chunk_size', 65536))
-        self.account_chunk_size = int(conf.get('account_chunk_size', 65536))
        self.client_chunk_size = int(conf.get('client_chunk_size', 65536))
        self.log_headers = conf.get('log_headers') == 'True'
        self.error_suppression_interval = \
@ -979,8 +976,6 @@ class BaseApplication(object):
        self.rate_limit_blacklist = [x.strip() for x in
            conf.get('rate_limit_account_blacklist', '').split(',')
            if x.strip()]
-        self.container_put_lock_timeout = \
-            int(conf.get('container_put_lock_timeout', 5))

    def get_controller(self, path):
        """