2010-07-30 14:57:20 -05:00
=====================
Administrator's Guide
=====================
------------------
Managing the Rings
------------------
2012-08-21 10:03:42 -07:00
You may build the storage rings on any server with the appropriate
version of Swift installed. Once built or changed (rebalanced), you
must distribute the rings to all the servers in the cluster. Storage
rings contain information about all the Swift storage partitions and
how they are distributed between the different nodes and disks.
Swift 1.6.0 is the last version to use a Python pickle format.
Subsequent versions use a different serialization format. **Rings
generated by Swift versions 1.6.0 and earlier may be read by any
version, but rings generated after 1.6.0 may only be read by Swift
versions greater than 1.6.0.** So when upgrading from version 1.6.0 or
earlier to a version greater than 1.6.0, either upgrade Swift on your
ring building server **last** after all Swift nodes have been successfully
upgraded, or refrain from generating rings until all Swift nodes have
been successfully upgraded.
If you need to downgrade from a version of swift greater than 1.6.0 to
a version less than or equal to 1.6.0, first downgrade your ring-building
server, generate new rings, push them out, then continue with the rest
of the downgrade.
For more information see :doc: `overview_ring` .
2010-11-30 12:24:55 -06:00
2010-07-30 14:57:20 -05:00
Removing a device from the ring::
swift-ring-builder <builder-file> remove <ip_address>/<device_name>
Removing a server from the ring::
swift-ring-builder <builder-file> remove <ip_address>
Adding devices to the ring:
See :ref: `ring-preparing`
See what devices for a server are in the ring::
swift-ring-builder <builder-file> search <ip_address>
Once you are done with all changes to the ring, the changes need to be
"committed"::
swift-ring-builder <builder-file> rebalance
Once the new rings are built, they should be pushed out to all the servers
in the cluster.
2013-01-25 08:00:33 -08:00
Optionally, if invoked as 'swift-ring-builder-safe' the directory containing
the specified builder file will be locked (via a .lock file in the parent
directory). This provides a basic safe guard against multiple instances
of the swift-ring-builder (or other utilities that observe this lock) from
attempting to write to or read the builder/ring files while operations are in
progress. This can be useful in environments where ring management has been
automated but the operator still needs to interact with the rings manually.
2010-11-30 14:15:41 -06:00
-----------------------
Scripting Ring Creation
-----------------------
You can create scripts to create the account and container rings and rebalance. Here's an example script for the Account ring. Use similar commands to create a make-container-ring.sh script on the proxy server node.
2010-11-30 12:24:55 -06:00
2012-04-10 12:25:01 -07:00
1. Create a script file called make-account-ring.sh on the proxy
server node with the following content::
2010-11-30 12:24:55 -06:00
#!/bin/bash
cd /etc/swift
rm -f account.builder account.ring.gz backups/account.builder backups/account.ring.gz
swift-ring-builder account.builder create 18 3 1
swift-ring-builder account.builder add z1-<account-server-1>:6002/sdb1 1
swift-ring-builder account.builder add z2-<account-server-2>:6002/sdb1 1
swift-ring-builder account.builder rebalance
2012-04-10 12:25:01 -07:00
You need to replace the values of <account-server-1>,
<account-server-2>, etc. with the IP addresses of the account
servers used in your setup. You can have as many account servers as
you need. All account servers are assumed to be listening on port
6002, and have a storage device called "sdb1" (this is a directory
name created under /drives when we setup the account server). The
"z1", "z2", etc. designate zones, and you can choose whether you
put devices in the same or different zones.
2010-11-30 12:24:55 -06:00
2. Make the script file executable and run it to create the account ring file::
chmod +x make-account-ring.sh
sudo ./make-account-ring.sh
2012-04-10 12:25:01 -07:00
3. Copy the resulting ring file /etc/swift/account.ring.gz to all the
account server nodes in your Swift environment, and put them in the
/etc/swift directory on these nodes. Make sure that every time you
change the account ring configuration, you copy the resulting ring
file to all the account nodes.
2010-11-30 12:24:55 -06:00
2010-07-30 14:57:20 -05:00
-----------------------
Handling System Updates
-----------------------
It is recommended that system updates and reboots are done a zone at a time.
This allows the update to happen, and for the Swift cluster to stay available
and responsive to requests. It is also advisable when updating a zone, let
it run for a while before updating the other zones to make sure the update
doesn't have any adverse effects.
----------------------
Handling Drive Failure
----------------------
In the event that a drive has failed, the first step is to make sure the drive
is unmounted. This will make it easier for swift to work around the failure
until it has been resolved. If the drive is going to be replaced immediately,
then it is just best to replace the drive, format it, remount it, and let
replication fill it up.
If the drive can't be replaced immediately, then it is best to leave it
unmounted, and remove the drive from the ring. This will allow all the
replicas that were on that drive to be replicated elsewhere until the drive
is replaced. Once the drive is replaced, it can be re-added to the ring.
-----------------------
Handling Server Failure
-----------------------
If a server is having hardware issues, it is a good idea to make sure the
swift services are not running. This will allow Swift to work around the
failure while you troubleshoot.
If the server just needs a reboot, or a small amount of work that should
only last a couple of hours, then it is probably best to let Swift work
around the failure and get the machine fixed and back online. When the
machine comes back online, replication will make sure that anything that is
missing during the downtime will get updated.
If the server has more serious issues, then it is probably best to remove
all of the server's devices from the ring. Once the server has been repaired
and is back online, the server's devices can be added back into the ring.
It is important that the devices are reformatted before putting them back
into the ring as it is likely to be responsible for a different set of
partitions than before.
-----------------------
Detecting Failed Drives
-----------------------
It has been our experience that when a drive is about to fail, error messages
will spew into `/var/log/kern.log` . There is a script called
`swift-drive-audit` that can be run via cron to watch for bad drives. If
errors are detected, it will unmount the bad drive, so that Swift can
work around it. The script takes a configuration file with the following
settings:
[drive-audit]
================== ========== ===========================================
Option Default Description
------------------ ---------- -------------------------------------------
log_facility LOG_LOCAL0 Syslog log facility
log_level INFO Log level
device_dir /srv/node Directory devices are mounted under
minutes 60 Number of minutes to look back in
`/var/log/kern.log`
error_limit 1 Number of errors to find before a device
is unmounted
================== ========== ===========================================
This script has only been tested on Ubuntu 10.04, so if you are using a
different distro or OS, some care should be taken before using in production.
--------------
Cluster Health
--------------
2011-03-31 22:32:41 +00:00
There is a swift-dispersion-report tool for measuring overall cluster health.
This is accomplished by checking if a set of deliberately distributed
containers and objects are currently in their proper places within the cluster.
2010-08-17 12:36:49 -07:00
For instance, a common deployment has three replicas of each object. The health
of that object can be measured by checking if each replica is in its proper
place. If only 2 of the 3 is in place the object's heath can be said to be at
66.66%, where 100% would be perfect.
A single object's health, especially an older object, usually reflects the
health of that entire partition the object is in. If we make enough objects on
a distinct percentage of the partitions in the cluster, we can get a pretty
valid estimate of the overall cluster health. In practice, about 1% partition
coverage seems to balance well between accuracy and the amount of time it takes
to gather results.
The first thing that needs to be done to provide this health value is create a
new account solely for this usage. Next, we need to place the containers and
objects throughout the system so that they are on distinct partitions. The
2011-03-31 22:32:41 +00:00
swift-dispersion-populate tool does this by making up random container and
object names until they fall on distinct partitions. Last, and repeatedly for
the life of the cluster, we need to run the swift-dispersion-report tool to
check the health of each of these containers and objects.
2010-08-17 12:36:49 -07:00
These tools need direct access to the entire cluster and to the ring files
2011-03-14 02:56:37 +00:00
(installing them on a proxy server will probably do). Both
2011-03-31 22:32:41 +00:00
swift-dispersion-populate and swift-dispersion-report use the same
configuration file, /etc/swift/dispersion.conf. Example conf file::
2010-08-17 12:36:49 -07:00
2011-10-05 16:54:56 +02:00
[dispersion]
2012-07-13 17:48:37 -05:00
auth_url = http://localhost:8080/auth/v1.0
2010-08-17 12:36:49 -07:00
auth_user = test:tester
auth_key = testing
2013-01-23 09:58:28 +00:00
endpoint_type = internalURL
2010-08-17 12:36:49 -07:00
There are also options for the conf file for specifying the dispersion coverage
2011-03-31 22:32:41 +00:00
(defaults to 1%), retries, concurrency, etc. though usually the defaults are
fine.
2010-08-17 12:36:49 -07:00
2011-03-31 22:32:41 +00:00
Once the configuration is in place, run `swift-dispersion-populate` to populate
2010-08-17 12:36:49 -07:00
the containers and objects throughout the cluster.
Now that those containers and objects are in place, you can run
2011-03-31 22:32:41 +00:00
`swift-dispersion-report` to get a dispersion report, or the overall health of
2010-08-17 12:36:49 -07:00
the cluster. Here is an example of a cluster in perfect health::
2011-03-31 22:32:41 +00:00
$ swift-dispersion-report
2010-08-17 12:36:49 -07:00
Queried 2621 containers for dispersion reporting, 19s, 0 retries
100.00% of container copies found (7863 of 7863)
Sample represents 1.00% of the container partition space
Queried 2619 objects for dispersion reporting, 7s, 0 retries
100.00% of object copies found (7857 of 7857)
Sample represents 1.00% of the object partition space
Now I'll deliberately double the weight of a device in the object ring (with
replication turned off) and rerun the dispersion report to show what impact
that has::
$ swift-ring-builder object.builder set_weight d0 200
$ swift-ring-builder object.builder rebalance
...
2011-03-31 22:32:41 +00:00
$ swift-dispersion-report
2010-08-17 12:36:49 -07:00
Queried 2621 containers for dispersion reporting, 8s, 0 retries
100.00% of container copies found (7863 of 7863)
Sample represents 1.00% of the container partition space
Queried 2619 objects for dispersion reporting, 7s, 0 retries
There were 1763 partitions missing one copy.
77.56% of object copies found (6094 of 7857)
Sample represents 1.00% of the object partition space
You can see the health of the objects in the cluster has gone down
significantly. Of course, I only have four devices in this test environment, in
a production environment with many many devices the impact of one device change
is much less. Next, I'll run the replicators to get everything put back into
place and then rerun the dispersion report::
... start object replicators and monitor logs until they're caught up ...
2011-03-31 22:32:41 +00:00
$ swift-dispersion-report
2010-08-17 12:36:49 -07:00
Queried 2621 containers for dispersion reporting, 17s, 0 retries
100.00% of container copies found (7863 of 7863)
Sample represents 1.00% of the container partition space
Queried 2619 objects for dispersion reporting, 7s, 0 retries
100.00% of object copies found (7857 of 7857)
Sample represents 1.00% of the object partition space
2012-12-03 16:12:10 -06:00
You can also run the report for only containers or objects::
$ swift-dispersion-report --container-only
Queried 2621 containers for dispersion reporting, 17s, 0 retries
100.00% of container copies found (7863 of 7863)
Sample represents 1.00% of the container partition space
$ swift-dispersion-report --object-only
Queried 2619 objects for dispersion reporting, 7s, 0 retries
100.00% of object copies found (7857 of 7857)
Sample represents 1.00% of the object partition space
2012-02-29 04:24:26 +00:00
Alternatively, the dispersion report can also be output in json format. This
allows it to be more easily consumed by third party utilities::
$ swift-dispersion-report -j
{"object": {"retries:": 0, "missing_two": 0, "copies_found": 7863, "missing_one": 0, "copies_expected": 7863, "pct_found": 100.0, "overlapping": 0, "missing_all": 0}, "container": {"retries:": 0, "missing_two": 0, "copies_found": 12534, "missing_one": 0, "copies_expected": 12534, "pct_found": 100.0, "overlapping": 15, "missing_all": 0}}
2010-07-30 14:57:20 -05:00
2011-10-18 21:10:50 +00:00
--------------------------------
Cluster Telemetry and Monitoring
--------------------------------
2012-05-14 18:01:48 -05:00
Various metrics and telemetry can be obtained from the account, container, and
object servers using the recon server middleware and the swift-recon cli. To do
so update your account, container, or object servers pipelines to include recon
and add the associated filter config.
object-server.conf sample::
2011-10-18 21:10:50 +00:00
[pipeline:main]
pipeline = recon object-server
2012-05-14 18:01:48 -05:00
2011-10-18 21:10:50 +00:00
[filter:recon]
use = egg:swift#recon
recon_cache_path = /var/cache/swift
2012-05-14 18:01:48 -05:00
container-server.conf sample::
2011-10-18 21:10:50 +00:00
2012-05-14 18:01:48 -05:00
[pipeline:main]
pipeline = recon container-server
2011-10-18 21:10:50 +00:00
2012-05-14 18:01:48 -05:00
[filter:recon]
use = egg:swift#recon
2011-10-18 21:10:50 +00:00
recon_cache_path = /var/cache/swift
2012-05-14 18:01:48 -05:00
account-server.conf sample::
[pipeline:main]
pipeline = recon account-server
[filter:recon]
use = egg:swift#recon
recon_cache_path = /var/cache/swift
The recon_cache_path simply sets the directory where stats for a few items will
be stored. Depending on the method of deployment you may need to create this
directory manually and ensure that swift has read/write access.
Finally, if you also wish to track asynchronous pending on your object
servers you will need to setup a cronjob to run the swift-recon-cron script
periodically on your object servers::
2011-10-18 21:10:50 +00:00
*/5 * * * * swift /usr/bin/swift-recon-cron /etc/swift/object-server.conf
2012-05-14 18:01:48 -05:00
2013-03-17 15:58:06 -07:00
Once the recon middleware is enabled, a GET request for
"/recon/<metric>" to the backend object server will return a
JSON-formatted response::
2011-10-18 21:10:50 +00:00
fhines@ubuntu:~$ curl -i http://localhost:6030/recon/async
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 20
Date: Tue, 18 Oct 2011 21:03:01 GMT
{"async_pending": 0}
2013-03-17 15:58:06 -07:00
Note that the default port for the object server is 6000, except on a
Swift All-In-One installation, which uses 6010, 6020, 6030, and 6040.
2012-05-28 13:36:59 -05:00
The following metrics and telemetry are currently exposed:
2012-05-14 18:01:48 -05:00
2012-05-28 13:36:59 -05:00
========================= ========================================================================================
2012-05-14 18:01:48 -05:00
Request URI Description
2012-05-28 13:36:59 -05:00
------------------------- ----------------------------------------------------------------------------------------
2012-05-14 18:01:48 -05:00
/recon/load returns 1,5, and 15 minute load average
/recon/mem returns /proc/meminfo
/recon/mounted returns *ALL* currently mounted filesystems
/recon/unmounted returns all unmounted drives if mount_check = True
/recon/diskusage returns disk utilization for storage devices
/recon/ringmd5 returns object/container/account ring md5sums
/recon/quarantined returns # of quarantined objects/accounts/containers
/recon/sockstat returns consumable info from /proc/net/sockstat|6
/recon/devices returns list of devices and devices dir i.e. /srv/node
/recon/async returns count of async pending
2013-02-12 15:38:40 -08:00
/recon/replication returns object replication times (for backward compatibility)
2012-05-14 18:01:48 -05:00
/recon/replication/<type> returns replication info for given type (account, container, object)
/recon/auditor/<type> returns auditor stats on last reported scan for given type (account, container, object)
/recon/updater/<type> returns last updater sweep times for given type (container, object)
2012-05-28 13:36:59 -05:00
========================= ========================================================================================
2011-10-18 21:10:50 +00:00
This information can also be queried via the swift-recon command line utility::
fhines@ubuntu:~$ swift-recon -h
Usage:
2012-05-14 18:01:48 -05:00
usage: swift-recon <server_type> [-v] [--suppress] [-a] [-r] [-u] [-d]
[-l] [--md5] [--auditor] [--updater] [--expirer] [--sockstat]
<server_type> account|container|object
Defaults to object server.
ex: swift-recon container -l --auditor
2011-10-18 21:10:50 +00:00
Options:
-h, --help show this help message and exit
-v, --verbose Print verbose info
--suppress Suppress most connection related errors
-a, --async Get async stats
-r, --replication Get replication stats
2012-05-14 18:01:48 -05:00
--auditor Get auditor stats
--updater Get updater stats
--expirer Get expirer stats
2011-10-18 21:10:50 +00:00
-u, --unmounted Check cluster for unmounted devices
-d, --diskusage Get disk usage stats
-l, --loadstats Get cluster load average stats
-q, --quarantined Get cluster quarantine stats
2012-05-14 18:01:48 -05:00
--md5 Get md5sum of servers ring and compare to local copy
2011-11-15 17:55:14 +00:00
--sockstat Get cluster socket usage stats
2012-05-14 18:01:48 -05:00
--all Perform all checks. Equal to -arudlq --md5 --sockstat
2011-10-18 21:10:50 +00:00
-z ZONE, --zone=ZONE Only query servers in specified zone
2012-05-14 18:01:48 -05:00
-t SECONDS, --timeout=SECONDS
Time to wait for a response from a server
2011-10-18 21:10:50 +00:00
--swiftdir=SWIFTDIR Default = /etc/swift
2012-05-14 18:01:48 -05:00
For example, to obtain container replication info from all hosts in zone "3"::
2011-10-18 21:10:50 +00:00
2012-05-14 18:01:48 -05:00
fhines@ubuntu:~$ swift-recon container -r --zone 3
2011-10-18 21:10:50 +00:00
===============================================================================
2012-05-14 18:01:48 -05:00
--> Starting reconnaissance on 1 hosts
2011-10-18 21:10:50 +00:00
===============================================================================
2012-05-14 18:01:48 -05:00
[2012-04-02 02:45:48] Checking on replication
[failure] low: 0.000, high: 0.000, avg: 0.000, reported: 1
[success] low: 486.000, high: 486.000, avg: 486.000, reported: 1
[replication_time] low: 20.853, high: 20.853, avg: 20.853, reported: 1
[attempted] low: 243.000, high: 243.000, avg: 243.000, reported: 1
2011-10-18 21:10:50 +00:00
Adding StatsD logging to Swift.
Documentation, including a list of metrics reported and their semantics,
is in the Admin Guide in a new section, "Reporting Metrics to StatsD".
An optional "metric prefix" may be configured which will be prepended to
every metric name sent to StatsD.
Here is the rationale for doing a deep integration like this versus only
sending metrics to StatsD in middleware. It's the only way to report
some internal activities of Swift in a real-time manner. So to have one
way of reporting to StatsD and one place/style of configuration, even
some things (like, say, timing of PUT requests into the proxy-server)
which could be logged via middleware are consistently logged the same
way (deep integration via the logger delegate methods).
When log_statsd_host is configured, get_logger() injects a
swift.common.utils.StatsdClient object into the logger as
logger.statsd_client. Then a set of delegate methods on LogAdapter
either pass through to the StatsdClient object or become no-ops. This
allows StatsD logging to look like:
self.logger.increment('some.metric.here')
and do the right thing in all cases and with no messy conditional logic.
I wanted to use the pystatsd module for the StatsD client, but the
version on PyPi is lagging the git repo (and is missing both the prefix
functionality and timing_since() method). So I wrote my
swift.common.utils.StatsdClient. The interface is the same as
pystatsd.Client, but the code was written from scratch. It's pretty
simple, and the tests I added cover it. This also frees Swift from an
optional dependency on the pystatsd module, making this feature easier
to enable.
There's test coverage for the new code and all existing tests continue
to pass.
Refactored out _one_audit_pass() method in swift/account/auditor.py and
swift/container/auditor.py.
Fixed some misc. PEP8 violations.
Misc test cleanups and refactorings (particularly the way "fake logging"
is handled).
Change-Id: Ie968a9ae8771f59ee7591e2ae11999c44bfe33b2
2012-04-01 16:47:08 -07:00
---------------------------
Reporting Metrics to StatsD
---------------------------
If you have a StatsD_ server running, Swift may be configured to send it
real-time operational metrics. To enable this, set the following
configuration entries (see the sample configuration files)::
log_statsd_host = localhost
log_statsd_port = 8125
2013-01-19 15:25:27 -08:00
log_statsd_default_sample_rate = 1.0
log_statsd_sample_rate_factor = 1.0
Adding StatsD logging to Swift.
Documentation, including a list of metrics reported and their semantics,
is in the Admin Guide in a new section, "Reporting Metrics to StatsD".
An optional "metric prefix" may be configured which will be prepended to
every metric name sent to StatsD.
Here is the rationale for doing a deep integration like this versus only
sending metrics to StatsD in middleware. It's the only way to report
some internal activities of Swift in a real-time manner. So to have one
way of reporting to StatsD and one place/style of configuration, even
some things (like, say, timing of PUT requests into the proxy-server)
which could be logged via middleware are consistently logged the same
way (deep integration via the logger delegate methods).
When log_statsd_host is configured, get_logger() injects a
swift.common.utils.StatsdClient object into the logger as
logger.statsd_client. Then a set of delegate methods on LogAdapter
either pass through to the StatsdClient object or become no-ops. This
allows StatsD logging to look like:
self.logger.increment('some.metric.here')
and do the right thing in all cases and with no messy conditional logic.
I wanted to use the pystatsd module for the StatsD client, but the
version on PyPi is lagging the git repo (and is missing both the prefix
functionality and timing_since() method). So I wrote my
swift.common.utils.StatsdClient. The interface is the same as
pystatsd.Client, but the code was written from scratch. It's pretty
simple, and the tests I added cover it. This also frees Swift from an
optional dependency on the pystatsd module, making this feature easier
to enable.
There's test coverage for the new code and all existing tests continue
to pass.
Refactored out _one_audit_pass() method in swift/account/auditor.py and
swift/container/auditor.py.
Fixed some misc. PEP8 violations.
Misc test cleanups and refactorings (particularly the way "fake logging"
is handled).
Change-Id: Ie968a9ae8771f59ee7591e2ae11999c44bfe33b2
2012-04-01 16:47:08 -07:00
log_statsd_metric_prefix = [empty-string]
If `log_statsd_host` is not set, this feature is disabled. The default values
for the other settings are given above.
.. _StatsD: http://codeascraft.etsy.com/2011/02/15/measure-anything-measure-everything/
.. _Graphite: http://graphite.wikidot.com/
.. _Ganglia: http://ganglia.sourceforge.net/
The sample rate is a real number between 0 and 1 which defines the
probability of sending a sample for any given event or timing measurement.
This sample rate is sent with each sample to StatsD and used to
multiply the value. For example, with a sample rate of 0.5, StatsD will
multiply that counter's value by 2 when flushing the metric to an upstream
2013-01-19 15:25:27 -08:00
monitoring system (Graphite_, Ganglia_, etc.).
Some relatively high-frequency metrics have a default sample rate less than
one. If you want to override the default sample rate for all metrics whose
default sample rate is not specified in the Swift source, you may set
`log_statsd_default_sample_rate` to a value less than one. This is NOT
recommended (see next paragraph). A better way to reduce StatsD load is to
adjust `log_statsd_sample_rate_factor` to a value less than one. The
`log_statsd_sample_rate_factor` is multiplied to any sample rate (either the
global default or one specified by the actual metric logging call in the Swift
source) prior to handling. In other words, this one tunable can lower the
frequency of all StatsD logging by a proportional amount.
To get the best data, start with the default `log_statsd_default_sample_rate`
and `log_statsd_sample_rate_factor` values of 1 and only lower
`log_statsd_sample_rate_factor` if needed. The
`log_statsd_default_sample_rate` should not be used and remains for backward
compatibility only.
Adding StatsD logging to Swift.
Documentation, including a list of metrics reported and their semantics,
is in the Admin Guide in a new section, "Reporting Metrics to StatsD".
An optional "metric prefix" may be configured which will be prepended to
every metric name sent to StatsD.
Here is the rationale for doing a deep integration like this versus only
sending metrics to StatsD in middleware. It's the only way to report
some internal activities of Swift in a real-time manner. So to have one
way of reporting to StatsD and one place/style of configuration, even
some things (like, say, timing of PUT requests into the proxy-server)
which could be logged via middleware are consistently logged the same
way (deep integration via the logger delegate methods).
When log_statsd_host is configured, get_logger() injects a
swift.common.utils.StatsdClient object into the logger as
logger.statsd_client. Then a set of delegate methods on LogAdapter
either pass through to the StatsdClient object or become no-ops. This
allows StatsD logging to look like:
self.logger.increment('some.metric.here')
and do the right thing in all cases and with no messy conditional logic.
I wanted to use the pystatsd module for the StatsD client, but the
version on PyPi is lagging the git repo (and is missing both the prefix
functionality and timing_since() method). So I wrote my
swift.common.utils.StatsdClient. The interface is the same as
pystatsd.Client, but the code was written from scratch. It's pretty
simple, and the tests I added cover it. This also frees Swift from an
optional dependency on the pystatsd module, making this feature easier
to enable.
There's test coverage for the new code and all existing tests continue
to pass.
Refactored out _one_audit_pass() method in swift/account/auditor.py and
swift/container/auditor.py.
Fixed some misc. PEP8 violations.
Misc test cleanups and refactorings (particularly the way "fake logging"
is handled).
Change-Id: Ie968a9ae8771f59ee7591e2ae11999c44bfe33b2
2012-04-01 16:47:08 -07:00
The metric prefix will be prepended to every metric sent to the StatsD server
For example, with::
log_statsd_metric_prefix = proxy01
the metric `proxy-server.errors` would be sent to StatsD as
`proxy01.proxy-server.errors` . This is useful for differentiating different
servers when sending statistics to a central StatsD server. If you run a local
StatsD server per node, you could configure a per-node metrics prefix there and
leave `log_statsd_metric_prefix` blank.
2013-01-19 10:57:14 -08:00
Note that metrics reported to StatsD are counters or timing data (which are
sent in units of milliseconds). StatsD usually expands timing data out to min,
max, avg, count, and 90th percentile per timing metric, but the details of
this behavior will depend on the configuration of your StatsD server. Some
important "gauge" metrics may still need to be collected using another method.
For example, the `object-server.async_pendings` StatsD metric counts the generation
of async_pendings in real-time, but will not tell you the current number of
async_pending container updates on disk at any point in time.
Adding StatsD logging to Swift.
Documentation, including a list of metrics reported and their semantics,
is in the Admin Guide in a new section, "Reporting Metrics to StatsD".
An optional "metric prefix" may be configured which will be prepended to
every metric name sent to StatsD.
Here is the rationale for doing a deep integration like this versus only
sending metrics to StatsD in middleware. It's the only way to report
some internal activities of Swift in a real-time manner. So to have one
way of reporting to StatsD and one place/style of configuration, even
some things (like, say, timing of PUT requests into the proxy-server)
which could be logged via middleware are consistently logged the same
way (deep integration via the logger delegate methods).
When log_statsd_host is configured, get_logger() injects a
swift.common.utils.StatsdClient object into the logger as
logger.statsd_client. Then a set of delegate methods on LogAdapter
either pass through to the StatsdClient object or become no-ops. This
allows StatsD logging to look like:
self.logger.increment('some.metric.here')
and do the right thing in all cases and with no messy conditional logic.
I wanted to use the pystatsd module for the StatsD client, but the
version on PyPi is lagging the git repo (and is missing both the prefix
functionality and timing_since() method). So I wrote my
swift.common.utils.StatsdClient. The interface is the same as
pystatsd.Client, but the code was written from scratch. It's pretty
simple, and the tests I added cover it. This also frees Swift from an
optional dependency on the pystatsd module, making this feature easier
to enable.
There's test coverage for the new code and all existing tests continue
to pass.
Refactored out _one_audit_pass() method in swift/account/auditor.py and
swift/container/auditor.py.
Fixed some misc. PEP8 violations.
Misc test cleanups and refactorings (particularly the way "fake logging"
is handled).
Change-Id: Ie968a9ae8771f59ee7591e2ae11999c44bfe33b2
2012-04-01 16:47:08 -07:00
Note also that the set of metrics collected, their names, and their semantics
are not locked down and will change over time. StatsD logging is currently in
a "beta" stage and will continue to evolve.
Metrics for `account-auditor` :
========================== =========================================================
Metric Name Description
-------------------------- ---------------------------------------------------------
`account-auditor.errors` Count of audit runs (across all account databases) which
caught an Exception.
`account-auditor.passes` Count of individual account databases which passed audit.
`account-auditor.failures` Count of individual account databases which failed audit.
`account-auditor.timing` Timing data for individual account database audits.
========================== =========================================================
Metrics for `account-reaper` :
============================================== ====================================================
Metric Name Description
---------------------------------------------- ----------------------------------------------------
`account-reaper.errors` Count of devices failing the mount check.
`account-reaper.timing` Timing data for each reap_account() call.
`account-reaper.return_codes.X` Count of HTTP return codes from various operations
2013-01-19 10:57:14 -08:00
(e.g. object listing, container deletion, etc.). The
Adding StatsD logging to Swift.
Documentation, including a list of metrics reported and their semantics,
is in the Admin Guide in a new section, "Reporting Metrics to StatsD".
An optional "metric prefix" may be configured which will be prepended to
every metric name sent to StatsD.
Here is the rationale for doing a deep integration like this versus only
sending metrics to StatsD in middleware. It's the only way to report
some internal activities of Swift in a real-time manner. So to have one
way of reporting to StatsD and one place/style of configuration, even
some things (like, say, timing of PUT requests into the proxy-server)
which could be logged via middleware are consistently logged the same
way (deep integration via the logger delegate methods).
When log_statsd_host is configured, get_logger() injects a
swift.common.utils.StatsdClient object into the logger as
logger.statsd_client. Then a set of delegate methods on LogAdapter
either pass through to the StatsdClient object or become no-ops. This
allows StatsD logging to look like:
self.logger.increment('some.metric.here')
and do the right thing in all cases and with no messy conditional logic.
I wanted to use the pystatsd module for the StatsD client, but the
version on PyPi is lagging the git repo (and is missing both the prefix
functionality and timing_since() method). So I wrote my
swift.common.utils.StatsdClient. The interface is the same as
pystatsd.Client, but the code was written from scratch. It's pretty
simple, and the tests I added cover it. This also frees Swift from an
optional dependency on the pystatsd module, making this feature easier
to enable.
There's test coverage for the new code and all existing tests continue
to pass.
Refactored out _one_audit_pass() method in swift/account/auditor.py and
swift/container/auditor.py.
Fixed some misc. PEP8 violations.
Misc test cleanups and refactorings (particularly the way "fake logging"
is handled).
Change-Id: Ie968a9ae8771f59ee7591e2ae11999c44bfe33b2
2012-04-01 16:47:08 -07:00
value for X is the first digit of the return code
(2 for 201, 4 for 404, etc.).
`account-reaper.containers_failures` Count of failures to delete a container.
`account-reaper.containers_deleted` Count of containers successfully deleted.
`account-reaper.containers_remaining` Count of containers which failed to delete with
zero successes.
`account-reaper.containers_possibly_remaining` Count of containers which failed to delete with
at least one success.
`account-reaper.objects_failures` Count of failures to delete an object.
`account-reaper.objects_deleted` Count of objects successfully deleted.
`account-reaper.objects_remaining` Count of objects which failed to delete with zero
successes.
`account-reaper.objects_possibly_remaining` Count of objects which failed to delete with at
least one success.
============================================== ====================================================
Metrics for `account-server` ("Not Found" is not considered an error and requests
which increment `errors` are not included in the timing data):
2012-11-06 02:44:43 -08:00
======================================== =======================================================
Metric Name Description
---------------------------------------- -------------------------------------------------------
`account-server.DELETE.errors.timing` Timing data for each DELETE request resulting in an
error: bad request, not mounted, missing timestamp.
`account-server.DELETE.timing` Timing data for each DELETE request not resulting in
an error.
`account-server.PUT.errors.timing` Timing data for each PUT request resulting in an error:
bad request, not mounted, conflict, recently-deleted.
`account-server.PUT.timing` Timing data for each PUT request not resulting in an
error.
`account-server.HEAD.errors.timing` Timing data for each HEAD request resulting in an
error: bad request, not mounted.
`account-server.HEAD.timing` Timing data for each HEAD request not resulting in
an error.
`account-server.GET.errors.timing` Timing data for each GET request resulting in an
error: bad request, not mounted, bad delimiter,
account listing limit too high, bad accept header.
`account-server.GET.timing` Timing data for each GET request not resulting in
an error.
`account-server.REPLICATE.errors.timing` Timing data for each REPLICATE request resulting in an
error: bad request, not mounted.
`account-server.REPLICATE.timing` Timing data for each REPLICATE request not resulting
in an error.
`account-server.POST.errors.timing` Timing data for each POST request resulting in an
error: bad request, bad or missing timestamp, not
mounted.
`account-server.POST.timing` Timing data for each POST request not resulting in
an error.
======================================== =======================================================
Adding StatsD logging to Swift.
Documentation, including a list of metrics reported and their semantics,
is in the Admin Guide in a new section, "Reporting Metrics to StatsD".
An optional "metric prefix" may be configured which will be prepended to
every metric name sent to StatsD.
Here is the rationale for doing a deep integration like this versus only
sending metrics to StatsD in middleware. It's the only way to report
some internal activities of Swift in a real-time manner. So to have one
way of reporting to StatsD and one place/style of configuration, even
some things (like, say, timing of PUT requests into the proxy-server)
which could be logged via middleware are consistently logged the same
way (deep integration via the logger delegate methods).
When log_statsd_host is configured, get_logger() injects a
swift.common.utils.StatsdClient object into the logger as
logger.statsd_client. Then a set of delegate methods on LogAdapter
either pass through to the StatsdClient object or become no-ops. This
allows StatsD logging to look like:
self.logger.increment('some.metric.here')
and do the right thing in all cases and with no messy conditional logic.
I wanted to use the pystatsd module for the StatsD client, but the
version on PyPi is lagging the git repo (and is missing both the prefix
functionality and timing_since() method). So I wrote my
swift.common.utils.StatsdClient. The interface is the same as
pystatsd.Client, but the code was written from scratch. It's pretty
simple, and the tests I added cover it. This also frees Swift from an
optional dependency on the pystatsd module, making this feature easier
to enable.
There's test coverage for the new code and all existing tests continue
to pass.
Refactored out _one_audit_pass() method in swift/account/auditor.py and
swift/container/auditor.py.
Fixed some misc. PEP8 violations.
Misc test cleanups and refactorings (particularly the way "fake logging"
is handled).
Change-Id: Ie968a9ae8771f59ee7591e2ae11999c44bfe33b2
2012-04-01 16:47:08 -07:00
Metrics for `account-replicator` :
2012-08-17 17:00:50 -07:00
===================================== ====================================================
Metric Name Description
------------------------------------- ----------------------------------------------------
`account-replicator.diffs` Count of syncs handled by sending differing rows.
`account-replicator.diff_caps` Count of "diffs" operations which failed because
"max_diffs" was hit.
`account-replicator.no_changes` Count of accounts found to be in sync.
`account-replicator.hashmatches` Count of accounts found to be in sync via hash
comparison (`broker.merge_syncs` was called).
2012-10-18 14:49:46 -07:00
`account-replicator.rsyncs` Count of completely missing accounts which were sent
2012-08-17 17:00:50 -07:00
via rsync.
`account-replicator.remote_merges` Count of syncs handled by sending entire database
via rsync.
`account-replicator.attempts` Count of database replication attempts.
`account-replicator.failures` Count of database replication attempts which failed
due to corruption (quarantined) or inability to read
as well as attempts to individual nodes which
failed.
`account-replicator.removes.<device>` Count of databases on <device> deleted because the
delete_timestamp was greater than the put_timestamp
and the database had no rows or because it was
successfully sync'ed to other locations and doesn't
belong here anymore.
`account-replicator.successes` Count of replication attempts to an individual node
which were successful.
`account-replicator.timing` Timing data for each database replication attempt
not resulting in a failure.
===================================== ====================================================
Adding StatsD logging to Swift.
Documentation, including a list of metrics reported and their semantics,
is in the Admin Guide in a new section, "Reporting Metrics to StatsD".
An optional "metric prefix" may be configured which will be prepended to
every metric name sent to StatsD.
Here is the rationale for doing a deep integration like this versus only
sending metrics to StatsD in middleware. It's the only way to report
some internal activities of Swift in a real-time manner. So to have one
way of reporting to StatsD and one place/style of configuration, even
some things (like, say, timing of PUT requests into the proxy-server)
which could be logged via middleware are consistently logged the same
way (deep integration via the logger delegate methods).
When log_statsd_host is configured, get_logger() injects a
swift.common.utils.StatsdClient object into the logger as
logger.statsd_client. Then a set of delegate methods on LogAdapter
either pass through to the StatsdClient object or become no-ops. This
allows StatsD logging to look like:
self.logger.increment('some.metric.here')
and do the right thing in all cases and with no messy conditional logic.
I wanted to use the pystatsd module for the StatsD client, but the
version on PyPi is lagging the git repo (and is missing both the prefix
functionality and timing_since() method). So I wrote my
swift.common.utils.StatsdClient. The interface is the same as
pystatsd.Client, but the code was written from scratch. It's pretty
simple, and the tests I added cover it. This also frees Swift from an
optional dependency on the pystatsd module, making this feature easier
to enable.
There's test coverage for the new code and all existing tests continue
to pass.
Refactored out _one_audit_pass() method in swift/account/auditor.py and
swift/container/auditor.py.
Fixed some misc. PEP8 violations.
Misc test cleanups and refactorings (particularly the way "fake logging"
is handled).
Change-Id: Ie968a9ae8771f59ee7591e2ae11999c44bfe33b2
2012-04-01 16:47:08 -07:00
Metrics for `container-auditor` :
============================ ====================================================
Metric Name Description
---------------------------- ----------------------------------------------------
`container-auditor.errors` Incremented when an Exception is caught in an audit
pass (only once per pass, max).
`container-auditor.passes` Count of individual containers passing an audit.
`container-auditor.failures` Count of individual containers failing an audit.
`container-auditor.timing` Timing data for each container audit.
============================ ====================================================
Metrics for `container-replicator` :
2012-08-17 17:00:50 -07:00
======================================= ====================================================
Metric Name Description
--------------------------------------- ----------------------------------------------------
`container-replicator.diffs` Count of syncs handled by sending differing rows.
`container-replicator.diff_caps` Count of "diffs" operations which failed because
"max_diffs" was hit.
`container-replicator.no_changes` Count of containers found to be in sync.
`container-replicator.hashmatches` Count of containers found to be in sync via hash
comparison (`broker.merge_syncs` was called).
`container-replicator.rsyncs` Count of completely missing containers where were sent
via rsync.
`container-replicator.remote_merges` Count of syncs handled by sending entire database
via rsync.
`container-replicator.attempts` Count of database replication attempts.
`container-replicator.failures` Count of database replication attempts which failed
due to corruption (quarantined) or inability to read
as well as attempts to individual nodes which
failed.
`container-replicator.removes.<device>` Count of databases deleted on <device> because the
delete_timestamp was greater than the put_timestamp
and the database had no rows or because it was
successfully sync'ed to other locations and doesn't
belong here anymore.
`container-replicator.successes` Count of replication attempts to an individual node
which were successful.
`container-replicator.timing` Timing data for each database replication attempt
not resulting in a failure.
======================================= ====================================================
Adding StatsD logging to Swift.
Documentation, including a list of metrics reported and their semantics,
is in the Admin Guide in a new section, "Reporting Metrics to StatsD".
An optional "metric prefix" may be configured which will be prepended to
every metric name sent to StatsD.
Here is the rationale for doing a deep integration like this versus only
sending metrics to StatsD in middleware. It's the only way to report
some internal activities of Swift in a real-time manner. So to have one
way of reporting to StatsD and one place/style of configuration, even
some things (like, say, timing of PUT requests into the proxy-server)
which could be logged via middleware are consistently logged the same
way (deep integration via the logger delegate methods).
When log_statsd_host is configured, get_logger() injects a
swift.common.utils.StatsdClient object into the logger as
logger.statsd_client. Then a set of delegate methods on LogAdapter
either pass through to the StatsdClient object or become no-ops. This
allows StatsD logging to look like:
self.logger.increment('some.metric.here')
and do the right thing in all cases and with no messy conditional logic.
I wanted to use the pystatsd module for the StatsD client, but the
version on PyPi is lagging the git repo (and is missing both the prefix
functionality and timing_since() method). So I wrote my
swift.common.utils.StatsdClient. The interface is the same as
pystatsd.Client, but the code was written from scratch. It's pretty
simple, and the tests I added cover it. This also frees Swift from an
optional dependency on the pystatsd module, making this feature easier
to enable.
There's test coverage for the new code and all existing tests continue
to pass.
Refactored out _one_audit_pass() method in swift/account/auditor.py and
swift/container/auditor.py.
Fixed some misc. PEP8 violations.
Misc test cleanups and refactorings (particularly the way "fake logging"
is handled).
Change-Id: Ie968a9ae8771f59ee7591e2ae11999c44bfe33b2
2012-04-01 16:47:08 -07:00
Metrics for `container-server` ("Not Found" is not considered an error and requests
which increment `errors` are not included in the timing data):
2012-11-06 02:44:43 -08:00
========================================== ====================================================
Metric Name Description
------------------------------------------ ----------------------------------------------------
`container-server.DELETE.errors.timing` Timing data for DELETE request errors: bad request,
not mounted, missing timestamp, conflict.
`container-server.DELETE.timing` Timing data for each DELETE request not resulting in
an error.
`container-server.PUT.errors.timing` Timing data for PUT request errors: bad request,
missing timestamp, not mounted, conflict.
`container-server.PUT.timing` Timing data for each PUT request not resulting in an
error.
`container-server.HEAD.errors.timing` Timing data for HEAD request errors: bad request,
not mounted.
`container-server.HEAD.timing` Timing data for each HEAD request not resulting in
an error.
`container-server.GET.errors.timing` Timing data for GET request errors: bad request,
not mounted, parameters not utf8, bad accept header.
`container-server.GET.timing` Timing data for each GET request not resulting in
an error.
`container-server.REPLICATE.errors.timing` Timing data for REPLICATE request errors: bad
request, not mounted.
`container-server.REPLICATE.timing` Timing data for each REPLICATE request not resulting
in an error.
`container-server.POST.errors.timing` Timing data for POST request errors: bad request,
bad x-container-sync-to, not mounted.
`container-server.POST.timing` Timing data for each POST request not resulting in
an error.
========================================== ====================================================
Adding StatsD logging to Swift.
Documentation, including a list of metrics reported and their semantics,
is in the Admin Guide in a new section, "Reporting Metrics to StatsD".
An optional "metric prefix" may be configured which will be prepended to
every metric name sent to StatsD.
Here is the rationale for doing a deep integration like this versus only
sending metrics to StatsD in middleware. It's the only way to report
some internal activities of Swift in a real-time manner. So to have one
way of reporting to StatsD and one place/style of configuration, even
some things (like, say, timing of PUT requests into the proxy-server)
which could be logged via middleware are consistently logged the same
way (deep integration via the logger delegate methods).
When log_statsd_host is configured, get_logger() injects a
swift.common.utils.StatsdClient object into the logger as
logger.statsd_client. Then a set of delegate methods on LogAdapter
either pass through to the StatsdClient object or become no-ops. This
allows StatsD logging to look like:
self.logger.increment('some.metric.here')
and do the right thing in all cases and with no messy conditional logic.
I wanted to use the pystatsd module for the StatsD client, but the
version on PyPi is lagging the git repo (and is missing both the prefix
functionality and timing_since() method). So I wrote my
swift.common.utils.StatsdClient. The interface is the same as
pystatsd.Client, but the code was written from scratch. It's pretty
simple, and the tests I added cover it. This also frees Swift from an
optional dependency on the pystatsd module, making this feature easier
to enable.
There's test coverage for the new code and all existing tests continue
to pass.
Refactored out _one_audit_pass() method in swift/account/auditor.py and
swift/container/auditor.py.
Fixed some misc. PEP8 violations.
Misc test cleanups and refactorings (particularly the way "fake logging"
is handled).
Change-Id: Ie968a9ae8771f59ee7591e2ae11999c44bfe33b2
2012-04-01 16:47:08 -07:00
Metrics for `container-sync` :
=============================== ====================================================
Metric Name Description
------------------------------- ----------------------------------------------------
`container-sync.skips` Count of containers skipped because they don't have
sync'ing enabled.
`container-sync.failures` Count of failures sync'ing of individual containers.
`container-sync.syncs` Count of individual containers sync'ed successfully.
`container-sync.deletes` Count of container database rows sync'ed by
deletion.
`container-sync.deletes.timing` Timing data for each container database row
sychronization via deletion.
`container-sync.puts` Count of container database rows sync'ed by PUTing.
`container-sync.puts.timing` Timing data for each container database row
2012-10-18 14:49:46 -07:00
synchronization via PUTing.
Adding StatsD logging to Swift.
Documentation, including a list of metrics reported and their semantics,
is in the Admin Guide in a new section, "Reporting Metrics to StatsD".
An optional "metric prefix" may be configured which will be prepended to
every metric name sent to StatsD.
Here is the rationale for doing a deep integration like this versus only
sending metrics to StatsD in middleware. It's the only way to report
some internal activities of Swift in a real-time manner. So to have one
way of reporting to StatsD and one place/style of configuration, even
some things (like, say, timing of PUT requests into the proxy-server)
which could be logged via middleware are consistently logged the same
way (deep integration via the logger delegate methods).
When log_statsd_host is configured, get_logger() injects a
swift.common.utils.StatsdClient object into the logger as
logger.statsd_client. Then a set of delegate methods on LogAdapter
either pass through to the StatsdClient object or become no-ops. This
allows StatsD logging to look like:
self.logger.increment('some.metric.here')
and do the right thing in all cases and with no messy conditional logic.
I wanted to use the pystatsd module for the StatsD client, but the
version on PyPi is lagging the git repo (and is missing both the prefix
functionality and timing_since() method). So I wrote my
swift.common.utils.StatsdClient. The interface is the same as
pystatsd.Client, but the code was written from scratch. It's pretty
simple, and the tests I added cover it. This also frees Swift from an
optional dependency on the pystatsd module, making this feature easier
to enable.
There's test coverage for the new code and all existing tests continue
to pass.
Refactored out _one_audit_pass() method in swift/account/auditor.py and
swift/container/auditor.py.
Fixed some misc. PEP8 violations.
Misc test cleanups and refactorings (particularly the way "fake logging"
is handled).
Change-Id: Ie968a9ae8771f59ee7591e2ae11999c44bfe33b2
2012-04-01 16:47:08 -07:00
=============================== ====================================================
Metrics for `container-updater` :
============================== ====================================================
Metric Name Description
------------------------------ ----------------------------------------------------
`container-updater.successes` Count of containers which successfully updated their
account.
`container-updater.failures` Count of containers which failed to update their
account.
`container-updater.no_changes` Count of containers which didn't need to update
their account.
`container-updater.timing` Timing data for processing a container; only
includes timing for containers which needed to
update their accounts (i.e. "successes" and
"failures" but not "no_changes").
============================== ====================================================
Metrics for `object-auditor` :
============================ ====================================================
Metric Name Description
---------------------------- ----------------------------------------------------
`object-auditor.quarantines` Count of objects failing audit and quarantined.
`object-auditor.errors` Count of errors encountered while auditing objects.
`object-auditor.timing` Timing data for each object audit (does not include
any rate-limiting sleep time for
max_files_per_second, but does include rate-limiting
sleep time for max_bytes_per_second).
============================ ====================================================
Metrics for `object-expirer` :
======================== ====================================================
Metric Name Description
------------------------ ----------------------------------------------------
`object-expirer.objects` Count of objects expired.
`object-expirer.errors` Count of errors encountered while attempting to
expire an object.
`object-expirer.timing` Timing data for each object expiration attempt,
including ones resulting in an error.
======================== ====================================================
Metrics for `object-replicator` :
=================================================== ====================================================
Metric Name Description
--------------------------------------------------- ----------------------------------------------------
`object-replicator.partition.delete.count.<device>` A count of partitions on <device> which were
replicated to another node because they didn't
belong on this node. This metric is tracked
per-device to allow for "quiescence detection" for
object replication activity on each device.
`object-replicator.partition.delete.timing` Timing data for partitions replicated to another
node because they didn't belong on this node. This
metric is not tracked per device.
`object-replicator.partition.update.count.<device>` A count of partitions on <device> which were
replicated to another node, but also belong on this
node. As with delete.count, this metric is tracked
per-device.
`object-replicator.partition.update.timing` Timing data for partitions replicated which also
belong on this node. This metric is not tracked
per-device.
2012-10-18 14:49:46 -07:00
`object-replicator.suffix.hashes` Count of suffix directories whose hash (of filenames)
Adding StatsD logging to Swift.
Documentation, including a list of metrics reported and their semantics,
is in the Admin Guide in a new section, "Reporting Metrics to StatsD".
An optional "metric prefix" may be configured which will be prepended to
every metric name sent to StatsD.
Here is the rationale for doing a deep integration like this versus only
sending metrics to StatsD in middleware. It's the only way to report
some internal activities of Swift in a real-time manner. So to have one
way of reporting to StatsD and one place/style of configuration, even
some things (like, say, timing of PUT requests into the proxy-server)
which could be logged via middleware are consistently logged the same
way (deep integration via the logger delegate methods).
When log_statsd_host is configured, get_logger() injects a
swift.common.utils.StatsdClient object into the logger as
logger.statsd_client. Then a set of delegate methods on LogAdapter
either pass through to the StatsdClient object or become no-ops. This
allows StatsD logging to look like:
self.logger.increment('some.metric.here')
and do the right thing in all cases and with no messy conditional logic.
I wanted to use the pystatsd module for the StatsD client, but the
version on PyPi is lagging the git repo (and is missing both the prefix
functionality and timing_since() method). So I wrote my
swift.common.utils.StatsdClient. The interface is the same as
pystatsd.Client, but the code was written from scratch. It's pretty
simple, and the tests I added cover it. This also frees Swift from an
optional dependency on the pystatsd module, making this feature easier
to enable.
There's test coverage for the new code and all existing tests continue
to pass.
Refactored out _one_audit_pass() method in swift/account/auditor.py and
swift/container/auditor.py.
Fixed some misc. PEP8 violations.
Misc test cleanups and refactorings (particularly the way "fake logging"
is handled).
Change-Id: Ie968a9ae8771f59ee7591e2ae11999c44bfe33b2
2012-04-01 16:47:08 -07:00
was recalculated.
`object-replicator.suffix.syncs` Count of suffix directories replicated with rsync.
=================================================== ====================================================
Metrics for `object-server` :
2012-11-06 02:44:43 -08:00
======================================= ====================================================
Metric Name Description
--------------------------------------- ----------------------------------------------------
`object-server.quarantines` Count of objects (files) found bad and moved to
quarantine.
`object-server.async_pendings` Count of container updates saved as async_pendings
(may result from PUT or DELETE requests).
`object-server.POST.errors.timing` Timing data for POST request errors: bad request,
missing timestamp, delete-at in past, not mounted.
`object-server.POST.timing` Timing data for each POST request not resulting in
an error.
`object-server.PUT.errors.timing` Timing data for PUT request errors: bad request,
not mounted, missing timestamp, object creation
constraint violation, delete-at in past.
`object-server.PUT.timeouts` Count of object PUTs which exceeded max_upload_time.
`object-server.PUT.timing` Timing data for each PUT request not resulting in an
error.
2013-02-24 00:02:32 -08:00
`object-server.PUT.<device>.timing` Timing data per kB transfered (ms/kB) for each
non-zero-byte PUT request on each device.
Monitoring problematic devices, higher is bad.
2012-11-06 02:44:43 -08:00
`object-server.GET.errors.timing` Timing data for GET request errors: bad request,
not mounted, header timestamps before the epoch,
precondition failed.
File errors resulting in a quarantine are not
counted here.
`object-server.GET.timing` Timing data for each GET request not resulting in an
error. Includes requests which couldn't find the
object (including disk errors resulting in file
quarantine).
`object-server.HEAD.errors.timing` Timing data for HEAD request errors: bad request,
not mounted.
`object-server.HEAD.timing` Timing data for each HEAD request not resulting in
an error. Includes requests which couldn't find the
object (including disk errors resulting in file
quarantine).
`object-server.DELETE.errors.timing` Timing data for DELETE request errors: bad request,
missing timestamp, not mounted, precondition
failed. Includes requests which couldn't find or
match the object.
`object-server.DELETE.timing` Timing data for each DELETE request not resulting
in an error.
`object-server.REPLICATE.errors.timing` Timing data for REPLICATE request errors: bad
request, not mounted.
`object-server.REPLICATE.timing` Timing data for each REPLICATE request not resulting
in an error.
======================================= ====================================================
Adding StatsD logging to Swift.
Documentation, including a list of metrics reported and their semantics,
is in the Admin Guide in a new section, "Reporting Metrics to StatsD".
An optional "metric prefix" may be configured which will be prepended to
every metric name sent to StatsD.
Here is the rationale for doing a deep integration like this versus only
sending metrics to StatsD in middleware. It's the only way to report
some internal activities of Swift in a real-time manner. So to have one
way of reporting to StatsD and one place/style of configuration, even
some things (like, say, timing of PUT requests into the proxy-server)
which could be logged via middleware are consistently logged the same
way (deep integration via the logger delegate methods).
When log_statsd_host is configured, get_logger() injects a
swift.common.utils.StatsdClient object into the logger as
logger.statsd_client. Then a set of delegate methods on LogAdapter
either pass through to the StatsdClient object or become no-ops. This
allows StatsD logging to look like:
self.logger.increment('some.metric.here')
and do the right thing in all cases and with no messy conditional logic.
I wanted to use the pystatsd module for the StatsD client, but the
version on PyPi is lagging the git repo (and is missing both the prefix
functionality and timing_since() method). So I wrote my
swift.common.utils.StatsdClient. The interface is the same as
pystatsd.Client, but the code was written from scratch. It's pretty
simple, and the tests I added cover it. This also frees Swift from an
optional dependency on the pystatsd module, making this feature easier
to enable.
There's test coverage for the new code and all existing tests continue
to pass.
Refactored out _one_audit_pass() method in swift/account/auditor.py and
swift/container/auditor.py.
Fixed some misc. PEP8 violations.
Misc test cleanups and refactorings (particularly the way "fake logging"
is handled).
Change-Id: Ie968a9ae8771f59ee7591e2ae11999c44bfe33b2
2012-04-01 16:47:08 -07:00
Metrics for `object-updater` :
============================ ====================================================
Metric Name Description
---------------------------- ----------------------------------------------------
`object-updater.errors` Count of drives not mounted or async_pending files
with an unexpected name.
`object-updater.timing` Timing data for object sweeps to flush async_pending
container updates. Does not include object sweeps
which did not find an existing async_pending storage
directory.
`object-updater.quarantines` Count of async_pending container updates which were
corrupted and moved to quarantine.
`object-updater.successes` Count of successful container updates.
2012-10-18 14:49:46 -07:00
`object-updater.failures` Count of failed container updates.
`object-updater.unlinks` Count of async_pending files unlinked. An
async_pending file is unlinked either when it is
successfully processed or when the replicator sees
that there is a newer async_pending file for the
same object.
Adding StatsD logging to Swift.
Documentation, including a list of metrics reported and their semantics,
is in the Admin Guide in a new section, "Reporting Metrics to StatsD".
An optional "metric prefix" may be configured which will be prepended to
every metric name sent to StatsD.
Here is the rationale for doing a deep integration like this versus only
sending metrics to StatsD in middleware. It's the only way to report
some internal activities of Swift in a real-time manner. So to have one
way of reporting to StatsD and one place/style of configuration, even
some things (like, say, timing of PUT requests into the proxy-server)
which could be logged via middleware are consistently logged the same
way (deep integration via the logger delegate methods).
When log_statsd_host is configured, get_logger() injects a
swift.common.utils.StatsdClient object into the logger as
logger.statsd_client. Then a set of delegate methods on LogAdapter
either pass through to the StatsdClient object or become no-ops. This
allows StatsD logging to look like:
self.logger.increment('some.metric.here')
and do the right thing in all cases and with no messy conditional logic.
I wanted to use the pystatsd module for the StatsD client, but the
version on PyPi is lagging the git repo (and is missing both the prefix
functionality and timing_since() method). So I wrote my
swift.common.utils.StatsdClient. The interface is the same as
pystatsd.Client, but the code was written from scratch. It's pretty
simple, and the tests I added cover it. This also frees Swift from an
optional dependency on the pystatsd module, making this feature easier
to enable.
There's test coverage for the new code and all existing tests continue
to pass.
Refactored out _one_audit_pass() method in swift/account/auditor.py and
swift/container/auditor.py.
Fixed some misc. PEP8 violations.
Misc test cleanups and refactorings (particularly the way "fake logging"
is handled).
Change-Id: Ie968a9ae8771f59ee7591e2ae11999c44bfe33b2
2012-04-01 16:47:08 -07:00
============================ ====================================================
Upating proxy-server StatsD logging.
Removed many StatsD logging calls in proxy-server and added
swift-informant-style catch-all logging in the proxy-logger middleware.
Many errors previously rolled into the "proxy-server.<type>.errors"
counter will now appear broken down by response code and with timing
data at: "proxy-server.<type>.<verb>.<status>.timing". Also, bytes
transferred (sum of in + out) will be at:
"proxy-server.<type>.<verb>.<status>.xfer". The proxy-logging
middleware can get its StatsD config from standard vars in [DEFAULT] or
from access_log_statsd_* config vars in its config section.
Similarly to Swift Informant, request methods ("verbs") are filtered
using the new proxy-logging config var, "log_statsd_valid_http_methods"
which defaults to GET, HEAD, POST, PUT, DELETE, and COPY. Requests with
methods not in this list use "BAD_METHOD" for <verb> in the metric name.
To avoid user error, access_log_statsd_valid_http_methods is also
accepted.
Previously, proxy-server metrics used "Account", "Container", and
"Object" for the <type>, but these are now all lowercase.
Updated the admin guide's StatsD docs to reflect the above changes and
also include the "proxy-server.<type>.handoff_count" and
"proxy-server.<type>.handoff_all_count" metrics.
The proxy server now saves off the original req.method and proxy_logging
will use this if it can (both for request logging and as the "<verb>" in
the statsd timing metric). This fixes bug 1025433.
Removed some stale access_log_* related code in proxy/server.py. Also
removed the BaseApplication/Application distinction as it's no longer
necessary.
Fixed up the sample config files a bit (logging lines, mostly).
Fixed typo in SAIO development guide.
Got proxy_logging.py test coverage to 100%.
Fixed proxy_logging.py for PEP8 v1.3.2.
Enhanced test.unit.FakeLogger to track more calls to enable testing
StatsD metric calls.
Change-Id: I45d94cb76450be96d66fcfab56359bdfdc3a2576
2012-08-19 17:44:43 -07:00
Metrics for `proxy-server` (in the table, `<type>` is the proxy-server
controller responsible for the request and will be one of "account",
"container", or "object"):
======================================== ====================================================
Metric Name Description
---------------------------------------- ----------------------------------------------------
`proxy-server.errors` Count of errors encountered while serving requests
before the controller type is determined. Includes
invalid Content-Length, errors finding the internal
controller to handle the request, invalid utf8, and
bad URLs.
`proxy-server.<type>.handoff_count` Count of node hand-offs; only tracked if log_handoffs
is set in the proxy-server config.
`proxy-server.<type>.handoff_all_count` Count of times *only* hand-off locations were
utilized; only tracked if log_handoffs is set in the
proxy-server config.
`proxy-server.<type>.client_timeouts` Count of client timeouts (client did not read within
`client_timeout` seconds during a GET or did not
supply data within `client_timeout` seconds during
a PUT).
`proxy-server.<type>.client_disconnects` Count of detected client disconnects during PUT
operations (does NOT include caught Exceptions in
the proxy-server which caused a client disconnect).
======================================== ====================================================
Metrics for `proxy-logging` middleware (in the table, `<type>` is either the
proxy-server controller responsible for the request: "account", "container",
"object", or the string "SOS" if the request came from the `Swift Origin Server`_
middleware. The `<verb>` portion will be one of "GET", "HEAD", "POST", "PUT",
2012-11-06 15:13:01 -08:00
"DELETE", "COPY", "OPTIONS", or "BAD_METHOD". The list of valid HTTP methods
is configurable via the `log_statsd_valid_http_methods` config variable and
the default setting yields the above behavior.
Upating proxy-server StatsD logging.
Removed many StatsD logging calls in proxy-server and added
swift-informant-style catch-all logging in the proxy-logger middleware.
Many errors previously rolled into the "proxy-server.<type>.errors"
counter will now appear broken down by response code and with timing
data at: "proxy-server.<type>.<verb>.<status>.timing". Also, bytes
transferred (sum of in + out) will be at:
"proxy-server.<type>.<verb>.<status>.xfer". The proxy-logging
middleware can get its StatsD config from standard vars in [DEFAULT] or
from access_log_statsd_* config vars in its config section.
Similarly to Swift Informant, request methods ("verbs") are filtered
using the new proxy-logging config var, "log_statsd_valid_http_methods"
which defaults to GET, HEAD, POST, PUT, DELETE, and COPY. Requests with
methods not in this list use "BAD_METHOD" for <verb> in the metric name.
To avoid user error, access_log_statsd_valid_http_methods is also
accepted.
Previously, proxy-server metrics used "Account", "Container", and
"Object" for the <type>, but these are now all lowercase.
Updated the admin guide's StatsD docs to reflect the above changes and
also include the "proxy-server.<type>.handoff_count" and
"proxy-server.<type>.handoff_all_count" metrics.
The proxy server now saves off the original req.method and proxy_logging
will use this if it can (both for request logging and as the "<verb>" in
the statsd timing metric). This fixes bug 1025433.
Removed some stale access_log_* related code in proxy/server.py. Also
removed the BaseApplication/Application distinction as it's no longer
necessary.
Fixed up the sample config files a bit (logging lines, mostly).
Fixed typo in SAIO development guide.
Got proxy_logging.py test coverage to 100%.
Fixed proxy_logging.py for PEP8 v1.3.2.
Enhanced test.unit.FakeLogger to track more calls to enable testing
StatsD metric calls.
Change-Id: I45d94cb76450be96d66fcfab56359bdfdc3a2576
2012-08-19 17:44:43 -07:00
.. _Swift Origin Server: https://github.com/dpgoetz/sos
2013-01-19 10:57:14 -08:00
==================================================== ============================================
Metric Name Description
---------------------------------------------------- --------------------------------------------
`proxy-server.<type>.<verb>.<status>.timing` Timing data for requests, start to finish.
The <status> portion is the numeric HTTP
status code for the request (e.g. "200" or
"404").
`proxy-server.<type>.GET.<status>.first-byte.timing` Timing data up to completion of sending the
response headers (only for GET requests).
<status> and <type> are as for the main
timing metric.
`proxy-server.<type>.<verb>.<status>.xfer` This counter metric is the sum of bytes
transferred in (from clients) and out (to
clients) for requests. The <type>, <verb>,
and <status> portions of the metric are just
like the main timing metric.
==================================================== ============================================
Upating proxy-server StatsD logging.
Removed many StatsD logging calls in proxy-server and added
swift-informant-style catch-all logging in the proxy-logger middleware.
Many errors previously rolled into the "proxy-server.<type>.errors"
counter will now appear broken down by response code and with timing
data at: "proxy-server.<type>.<verb>.<status>.timing". Also, bytes
transferred (sum of in + out) will be at:
"proxy-server.<type>.<verb>.<status>.xfer". The proxy-logging
middleware can get its StatsD config from standard vars in [DEFAULT] or
from access_log_statsd_* config vars in its config section.
Similarly to Swift Informant, request methods ("verbs") are filtered
using the new proxy-logging config var, "log_statsd_valid_http_methods"
which defaults to GET, HEAD, POST, PUT, DELETE, and COPY. Requests with
methods not in this list use "BAD_METHOD" for <verb> in the metric name.
To avoid user error, access_log_statsd_valid_http_methods is also
accepted.
Previously, proxy-server metrics used "Account", "Container", and
"Object" for the <type>, but these are now all lowercase.
Updated the admin guide's StatsD docs to reflect the above changes and
also include the "proxy-server.<type>.handoff_count" and
"proxy-server.<type>.handoff_all_count" metrics.
The proxy server now saves off the original req.method and proxy_logging
will use this if it can (both for request logging and as the "<verb>" in
the statsd timing metric). This fixes bug 1025433.
Removed some stale access_log_* related code in proxy/server.py. Also
removed the BaseApplication/Application distinction as it's no longer
necessary.
Fixed up the sample config files a bit (logging lines, mostly).
Fixed typo in SAIO development guide.
Got proxy_logging.py test coverage to 100%.
Fixed proxy_logging.py for PEP8 v1.3.2.
Enhanced test.unit.FakeLogger to track more calls to enable testing
StatsD metric calls.
Change-Id: I45d94cb76450be96d66fcfab56359bdfdc3a2576
2012-08-19 17:44:43 -07:00
Metrics for `tempauth` middleware (in the table, `<reseller_prefix>` represents
the actual configured reseller_prefix or "`NONE` " if the reseller_prefix is the
empty string):
Adding StatsD logging to Swift.
Documentation, including a list of metrics reported and their semantics,
is in the Admin Guide in a new section, "Reporting Metrics to StatsD".
An optional "metric prefix" may be configured which will be prepended to
every metric name sent to StatsD.
Here is the rationale for doing a deep integration like this versus only
sending metrics to StatsD in middleware. It's the only way to report
some internal activities of Swift in a real-time manner. So to have one
way of reporting to StatsD and one place/style of configuration, even
some things (like, say, timing of PUT requests into the proxy-server)
which could be logged via middleware are consistently logged the same
way (deep integration via the logger delegate methods).
When log_statsd_host is configured, get_logger() injects a
swift.common.utils.StatsdClient object into the logger as
logger.statsd_client. Then a set of delegate methods on LogAdapter
either pass through to the StatsdClient object or become no-ops. This
allows StatsD logging to look like:
self.logger.increment('some.metric.here')
and do the right thing in all cases and with no messy conditional logic.
I wanted to use the pystatsd module for the StatsD client, but the
version on PyPi is lagging the git repo (and is missing both the prefix
functionality and timing_since() method). So I wrote my
swift.common.utils.StatsdClient. The interface is the same as
pystatsd.Client, but the code was written from scratch. It's pretty
simple, and the tests I added cover it. This also frees Swift from an
optional dependency on the pystatsd module, making this feature easier
to enable.
There's test coverage for the new code and all existing tests continue
to pass.
Refactored out _one_audit_pass() method in swift/account/auditor.py and
swift/container/auditor.py.
Fixed some misc. PEP8 violations.
Misc test cleanups and refactorings (particularly the way "fake logging"
is handled).
Change-Id: Ie968a9ae8771f59ee7591e2ae11999c44bfe33b2
2012-04-01 16:47:08 -07:00
========================================= ====================================================
Metric Name Description
----------------------------------------- ----------------------------------------------------
`tempauth.<reseller_prefix>.unauthorized` Count of regular requests which were denied with
HTTPUnauthorized.
`tempauth.<reseller_prefix>.forbidden` Count of regular requests which were denied with
HTTPForbidden.
`tempauth.<reseller_prefix>.token_denied` Count of token requests which were denied.
`tempauth.<reseller_prefix>.errors` Count of errors.
========================================= ====================================================
2010-07-30 14:57:20 -05:00
------------------------
Debugging Tips and Tools
------------------------
When a request is made to Swift, it is given a unique transaction id. This
id should be in every log line that has to do with that request. This can
2010-10-13 11:28:27 -05:00
be useful when looking at all the services that are hit by a single request.
2010-07-30 14:57:20 -05:00
If you need to know where a specific account, container or object is in the
cluster, `swift-get-nodes` will show the location where each replica should be.
If you are looking at an object on the server and need more info,
`swift-object-info` will display the account, container, replica locations
and metadata of the object.
If you want to audit the data for an account, `swift-account-audit` can be
used to crawl the account, checking that all containers and objects can be
found.
-----------------
Managing Services
-----------------
Swift services are generally managed with `swift-init` . the general usage is
`` swift-init <service> <command> `` , where service is the swift service to
manage (for example object, container, account, proxy) and command is one of:
========== ===============================================
Command Description
---------- -----------------------------------------------
start Start the service
stop Stop the service
restart Restart the service
shutdown Attempt to gracefully shutdown the service
reload Attempt to gracefully restart the service
========== ===============================================
A graceful shutdown or reload will finish any current requests before
completely stopping the old service. There is also a special case of
`swift-init all <command>` , which will run the command for all swift services.
2011-02-14 20:25:40 +00:00
--------------
Object Auditor
--------------
On system failures, the XFS file system can sometimes truncate files it's
2012-10-18 14:49:46 -07:00
trying to write and produce zero-byte files. The object-auditor will catch
2011-02-14 20:25:40 +00:00
these problems but in the case of a system crash it would be advisable to run
an extra, less rate limited sweep to check for these specific files. You can
run this command as follows:
2011-02-21 16:37:12 -08:00
`swift-object-auditor /path/to/object-server/config/file.conf once -z 1000`
"-z" means to only check for zero-byte files at 1000 files per second.
2012-09-28 12:24:15 -07:00
-----------------
Object Replicator
-----------------
At times it is useful to be able to run the object replicator on a specific
device or partition. You can run the object-replicator as follows:
swift-object-replicator /path/to/object-server/config/file.conf once --devices=sda,sdb
This will run the object replicator on only the sda and sdb devices. You can
likewise run that command with --partitions. Both params accept a comma
separated list of values. If both are specified they will be ANDed together.
These can only be run in "once" mode.
2011-12-03 07:42:08 +00:00
-------------
Swift Orphans
-------------
Swift Orphans are processes left over after a reload of a Swift server.
2012-04-10 12:25:01 -07:00
For example, when upgrading a proxy server you would probaby finish
with a `swift-init proxy-server reload` or `/etc/init.d/swift-proxy
reload`. This kills the parent proxy server process and leaves the
child processes running to finish processing whatever requests they
might be handling at the time. It then starts up a new parent proxy
server process and its children to handle new incoming requests. This
allows zero-downtime upgrades with no impact to existing requests.
The orphaned child processes may take a while to exit, depending on
the length of the requests they were handling. However, sometimes an
old process can be hung up due to some bug or hardware issue. In these
cases, these orphaned processes will hang around
forever. `swift-orphans` can be used to find and kill these orphans.
`swift-orphans` with no arguments will just list the orphans it finds
that were started more than 24 hours ago. You shouldn't really check
for orphans until 24 hours after you perform a reload, as some
requests can take a long time to process. `swift-orphans -k TERM` will
send the SIG_TERM signal to the orphans processes, or you can `kill
-TERM` the pids yourself if you prefer.
2011-12-03 07:42:08 +00:00
You can run `swift-orphans --help` for more options.
------------
Swift Oldies
------------
2012-04-10 12:25:01 -07:00
Swift Oldies are processes that have just been around for a long
time. There's nothing necessarily wrong with this, but it might
indicate a hung process if you regularly upgrade and reload/restart
services. You might have so many servers that you don't notice when a
2012-10-18 14:49:46 -07:00
reload/restart fails; `swift-oldies` can help with this.
2012-04-10 12:25:01 -07:00
For example, if you upgraded and reloaded/restarted everything 2 days
ago, and you've already cleaned up any orphans with `swift-orphans` ,
you can run `swift-oldies -a 48` to find any Swift processes still
around that were started more than 2 days ago and then investigate
them accordingly.
2012-10-26 14:56:10 -05:00
-------------------
Custom Log Handlers
-------------------
Swift supports setting up custom log handlers for services by specifying a
comma-separated list of functions to invoke when logging is setup. It does so
via the `log_custom_handlers` configuration option. Logger hooks invoked are
passed the same arguments as Swift's get_logger function (as well as the
getLogger and LogAdapter object):
============== ===============================================
Name Description
-------------- -----------------------------------------------
conf Configuration dict to read settings from
name Name of the logger received
log_to_console (optional) Write log messages to console on stderr
log_route Route for the logging received
fmt Override log format received
logger The logging.getLogger object
adapted_logger The LogAdapter object
============== ===============================================
A basic example that sets up a custom logger might look like the
following:
.. code-block :: python
def my_logger(conf, name, log_to_console, log_route, fmt, logger,
adapted_logger):
my_conf_opt = conf.get('some_custom_setting')
my_handler = third_party_logstore_handler(my_conf_opt)
logger.addHandler(my_handler)
See :ref: `custom-logger-hooks-label` for sample use cases.