Added region prefix to example commands for adding devices to ring.
Also updates description to include region prefix.
Change-Id: Ie6d6485b497cea973e37909b5b19b44946c8aa89
Currently, statistics are organized by command. However, it would also be
useful to display statistics organized by policy. Different policies may be
based on different storage properties (ie, faster disks).
With this change, all the statistics for object timers will be sent per policy
as well.
Policy statistics reporting will use policy index and the name in Graphite will
show as proxy-server.object.policy.<policy-index>.<verb>, etc.
Updated unit tests for per-policy stat reporting and added new unit tests for
invalid cases.
Updated documentation in the Administrator's Guide to reflect this new
aggregation.
Change-Id: Id70491e4833791a3fb8ff385953d69018514cd9c
This patch makes the count of object replication failure in recon.
And "failure_nodes" is added to Account Replicator and
Container Replicator.
Recon shows the count of object repliction failure as follows:
$ curl http://<ip>:<port>/recon/replication/object
{
"replication_last": 1416334368.60865,
"replication_stats": {
"attempted": 13346,
"failure": 870,
"failure_nodes": {
"192.168.0.1": {"sdb1": 3},
"192.168.0.2": {"sdb1": 851,
"sdc1": 1,
"sdd1": 8},
"192.168.0.3": {"sdb1": 3,
"sdc1": 4}
},
"hashmatch": 0,
"remove": 0,
"rsync": 0,
"start": 1416354240.9761429,
"success": 1908
},
"replication_time": 2316.5563162644703,
"object_replication_last": 1416334368.60865,
"object_replication_time": 2316.5563162644703
}
Note that 'object_replication_last' and 'object_replication_time' are
considered to be transitional and will be removed in the subsequent
releases. Use 'replication_last' and 'replication_time' instead.
Additionaly this patch adds the count in swift-recon and it will be
showed as follows:
$ swift-recon object -r
========================================================================
=======
--> Starting reconnaissance on 4 hosts
========================================================================
=======
[2014-11-27 16:14:09] Checking on replication
[replication_failure] low: 0, high: 0, avg: 0.0, total: 0, Failed: 0.0%,
no_result: 0, reported: 4
[replication_success] low: 3, high: 3, avg: 3.0, total: 12,
Failed: 0.0%, no_result: 0, reported: 4
[replication_time] low: 0, high: 0, avg: 0.0, total: 0, Failed: 0.0%,
no_result: 0, reported: 4
[replication_attempted] low: 1, high: 1, avg: 1.0, total: 4,
Failed: 0.0%, no_result: 0, reported: 4
Oldest completion was 2014-11-27 16:09:45 (4 minutes ago) by
192.168.0.4:6002.
Most recent completion was 2014-11-27 16:14:19 (-10 seconds ago) by
192.168.0.1:6002.
========================================================================
=======
In case there is a cluster which has servers, a server runs with this
patch and the other servers run without this patch. If swift-recon
executes on the server which runs with this patch, there are unnecessary
information on the output such as [failure], [success] and [attempted].
Because other servers which run without this patch are not able to
send a response with information that this patch needs.
Therefore once you apply this patch, you also apply this patch to other
servers before you execute swift-recon.
DocImpact
Change-Id: Iecd33655ae2568482833131f422679996c374d78
Co-Authored-By: Kenichiro Matsuda <matsuda_kenichi@jp.fujitsu.com>
Co-Authored-By: Brian Cline <bcline@softlayer.com>
Implements: blueprint enable-object-replication-failure-in-recon
When rsync pushes to a remote node with an unmounted drive and if
certain steps are not taken, rsync may attempt to write files to
the local drive at the location where the drive was mounted.
There are two suggested solutions for this issue:
1) Set the permissions for all mount points in /srv/node/
to root:root 755
2) Mount the drives elsewhere and symlink the drives to /srv/.../
The first method ensures that only root and not the swift user
can write in the /srv/.../ directories.
The second method will prompt a broken link issue if rsync
attempts to write to an unmounted drive.
Change-Id: I60ce4ed9ef8401768d5f78b6806cbb2e2a65303e
Closes-Bug: #1470576
This provides the capability to specify a project_name,
project_domain_name and user_domain_name in /etc/swift/dispersion.conf.
If this values are set in dispersion.conf they get populated to the
swift-client. With this it is possible to have a specific dispersion
project specified, which is not the keystone default domain. Changes
were applied to swift-dispersion-populate and swift-dispersion-report.
Relevant man pages, the example dispersion.conf and the admin guide were
updated accordingly.
DocImpact
Closes-Bug: #1468374
Change-Id: I0e716f8d281b4d0f510bc568bcee4a13fc480ff7
This change add call time to recon middleware and param --time to
recon CLI. This is usefull for checking if time in cluster is
synchronized.
Change-Id: I62373e681f64d0bd71f4aeb287953dd3b2ea5662
Doesn't work for anything other than policy 0. updated to allow user
to specify policy name on cmd line (as with object-info) which
then makes populate/report work with 3x, 2x, or EC style policies
Change-Id: Ib7c298f0f6d666b1ecca25315b88539f45cf9f95
Closes-Bug: 1458688
Sometimes, I get handed a builder file in a support ticket and a
question of the form "why is the balance [not] doing $thing?". When
that happens, I add a bunch of print statements to my local
swift/common/ring/builder.py, figure things out, and then delete the
print statements. This time, instead of deleting the print statements,
I turned them into debug() calls and added a "--debug" flag to the
rebalance command in hopes that someone else will find it useful.
Change-Id: I697af90984fa5b314ddf570280b4585ba0ba363c
Simply replacing a failed disk requires a very long time if the ring is not
changed, because all data will be replicated to a single new disk. This extends
the time to recover from missing replicas, and becomes even more important with
bigger disks.
This patch updates the doc to include a faster alternative by setting the weight
of a failed disk to 0. In this case the partitions from the failed disk are
distributed and replicated to the remaining disks in the cluster, and because
each disk gets only a fraction of the partitions it's also much faster.
Change-Id: I16617756359771ad89ca5d4690b58a014f481d9b
Add overview and example information for using Storage Policies.
DocImpact
Implements: blueprint storage-policies
Change-Id: I6f11f7a1bdaa6f3defb3baa56a820050e5f727f1
This allows an easier and more explicit way to tell swift-init to run on
specific servers. For example with an SAIO, this allows you to do
something like:
swift-init object-server.1 reload
to reload just the 1st object server. A more real world example is when
you are running separate servers for replication. In this example you
might have an object-server/public.conf and
object-server/replication.conf. With this change you can do something
like:
swift-init object-server.replication reload
to just reload the replication server.
DocImpact
Change-Id: I5c6046b5ee28e17dadfc5fc53d1d872d9bb8fe48
This is a very simple swift tool to retrieve information
of an account that is located on the storage node.
One can call the tool with a given account db file
as it is stored on the storage node system.
It will then return several information about that account.
Change-Id: Ibfeee790adc000fc177b4b3c03d22ff785fda325
This is a very simple swift tool to retrieve information
of a container that is located on the storage node.
One can call the tool with a given container db file
as it is stored on the storage node system.
It will then return several information about that container.
Change-Id: Ifebaed6c51a9ed5fbc0e7572bb43ef05d7dd254b
In object audit "once" mode we are allowing the user to specify
a sub-set of devices to audit using the "--devices" command-line
option. The sub-set is specified as a comma-separated list. This
patch is taken from a larger patch to enable parallel processing
in the object auditor.
We've had to modify recon so that it will work properly with this
change to "once" mode. We've modified dump_recon_cache()
so that it will store nested dictionaries, in other words it will
store a recon cache entry such as {'key1': {'key2': {...}}}. When
the object auditor is run in "once" mode with "--devices" set the
object_auditor_stats_ALL and ZBF entries look like:
{'object_auditor_stats_ALL': {'disk1disk2..diskn': {...}}}. When
swift-recon is run, it hunts through the nested dicts to find the
appropriate entries. The object auditor recon cache entries are set
to {} at the beginning of each audit cycle, and individual disk
entries are cleared from cache at the end of each disk's audit cycle.
DocImpact
Change-Id: Icc53dac0a8136f1b2f61d5e08baf7b4fd87c8123
Making it possible for one to overwrite the default set of regexes
used to search for device block errors in the log file. Also making
the log file naming pattern configurable by setting them in the
drive-audit.conf file.
Updating "Detecting Failed Drives" section on the admin guide as well.
Change-Id: I7bd3acffed196da3e09db4c9dcbb48a20bdd1cf0
The proxy can now be configured to prefer local object servers for PUT
requests, where "local" is governed by the "write_affinity". The
"write_affinity_node_count" setting controls how many local object
servers to try before giving up and going on to remote ones.
I chose to simply re-order the object servers instead of filtering out
nonlocal ones so that, if all of the local ones are down, clients can
still get successful responses (just slower).
The goal is to trade availability for throughput. By writing to local
object servers across fast LAN links, clients get better throughput
than if the object servers were far away over slow WAN links. The
downside, of course, is that data availability (not durability) may
suffer when drives fail.
The default configuration has no write affinity in it, so the default
behavior is unchanged.
Added some words about these settings to the admin guide.
DocImpact
Change-Id: I09a0bd00524544ff627a3bccdcdc48f40720a86e
Fixes bug 1104708
There could be severe performance drop for swift is one disk of one
storage node is problematic due to the tragic state of async disk I/O.
This patch provided PUT timing per kB transfered (ms/kB) monitoring
support for each non-zero-byte request of each disk and report to
statsD for alert.
-adding "object-server.PUT.<device>.timing" metrics for object-server.
DocImpact.
Change-Id: Ie94bddad28e8be52e71683bf6c9db988664abe47
This was an oustanding TODO for StatsD Swift metrics. A new timing
metric is tracked for (only) GET requests for accounts, containers,
and objects:
proxy-server.<req_type>.GET.<status_int>.first-byte.timing
Also updated StatsD documentation in the Admin Guide to clarify that
timing metrics are sent in units of milliseconds.
Change-Id: I5bb781c06cefcb5280f4fb1112a526c029fe0c20
If invoked as 'swift-ring-builder-safe' the directory containing the builder
file provided will be locked (via lock_parent_directory()). This provides a
small safe guard against multiple instances of the swift-ring-builder (or
other utilities that observe this lock) from attempting to write to or read
the builder/ring files while operations are in progress.
This is particularly useful in environments where ring management has been
automated (via Chef or custom solutions) but the operator still occasionally
needs to manually interact with the ring.
DocImpact
Change-Id: Ia362744a8151a91bfb586d01da582906726852e6
As Dieter pointed out in bug 1090495
(https://bugs.launchpad.net/swift/+bug/1090495), the volume of metrics
can vary wildly between StatsD metrics.
This patch implements a partial solution by reducing the sample_rate
used for known high-volume metrics (operational experience will need to
inform this over time) and introducing a new tunable,
log_statsd_sample_rate_factor which is multiplied by the sample_rate for
every statsd stat. This tunable can be used to reduce StatsD traffic
proportionally for all metrics and is intended to replace
log_statsd_default_sample_rate, which is left alone for
backward-compatibility, should anyone be using it.
This patch also includes a drive-by fix for log_udp_port which wasn't
being converted to an int (I didn't verify that actually causes trouble
in SysLogHandler(), but it's definitely an improvement regardles).
Change-Id: Id404636e3629f6431cf1c4e64a143959750a3c23
- Add two optional flags that let you limit swift-dispersion-report to only
reporting on containers OR objects.
- Also make dispersion.conf and swift-dispersion-report manpages
current.
DocImpact
Change-Id: Iad56133cad261241db27d0e2103098e3c2f3c245
It's not sufficient to just look at swift.object-updater.successes to
see the async_pending unlink rate. There are two different spots where
unlinks happen: one when an async_pending has been successfully
processed, and another when the updater notices multiple
async_pendings for the same object. Both events are now tracked under
the same name: swift.object-updater.unlinks.
FakeLogger has now sprouted a couple of convenience methods for
testing logged metrics.
Fixed pep8 1.3.3's complaints in the files this diff touches.
Also: bonus speling and, grammar fixes in the admin guide.
Change-Id: I8c1493784adbe24ba2b5512615e87669b3d94505
Removed many StatsD logging calls in proxy-server and added
swift-informant-style catch-all logging in the proxy-logger middleware.
Many errors previously rolled into the "proxy-server.<type>.errors"
counter will now appear broken down by response code and with timing
data at: "proxy-server.<type>.<verb>.<status>.timing". Also, bytes
transferred (sum of in + out) will be at:
"proxy-server.<type>.<verb>.<status>.xfer". The proxy-logging
middleware can get its StatsD config from standard vars in [DEFAULT] or
from access_log_statsd_* config vars in its config section.
Similarly to Swift Informant, request methods ("verbs") are filtered
using the new proxy-logging config var, "log_statsd_valid_http_methods"
which defaults to GET, HEAD, POST, PUT, DELETE, and COPY. Requests with
methods not in this list use "BAD_METHOD" for <verb> in the metric name.
To avoid user error, access_log_statsd_valid_http_methods is also
accepted.
Previously, proxy-server metrics used "Account", "Container", and
"Object" for the <type>, but these are now all lowercase.
Updated the admin guide's StatsD docs to reflect the above changes and
also include the "proxy-server.<type>.handoff_count" and
"proxy-server.<type>.handoff_all_count" metrics.
The proxy server now saves off the original req.method and proxy_logging
will use this if it can (both for request logging and as the "<verb>" in
the statsd timing metric). This fixes bug 1025433.
Removed some stale access_log_* related code in proxy/server.py. Also
removed the BaseApplication/Application distinction as it's no longer
necessary.
Fixed up the sample config files a bit (logging lines, mostly).
Fixed typo in SAIO development guide.
Got proxy_logging.py test coverage to 100%.
Fixed proxy_logging.py for PEP8 v1.3.2.
Enhanced test.unit.FakeLogger to track more calls to enable testing
StatsD metric calls.
Change-Id: I45d94cb76450be96d66fcfab56359bdfdc3a2576
To tell when replication for a device has finished, it's important to
know when the replicator is removing objects. This was previously
handled for the object-replicator
(object-replicator.partition.delete.count.<device> and
object-replicator.partition.update.count.<device> metrics) but not the
account and container replicators.
This patch extends the existing DB removal count metrics to make them
per-device. The new metrics are:
account-replicator.removes.<device>
container-replicator.removes.<device>
There's also a bonus refactoring and increased test coverage of the DB
replicator code.
Change-Id: I2067317d4a5f8ad2a496834147954bdcdfc541c1
The Admin Guide now contains information about the ring serialization
change (and importantly, how to downgrade, if necessary).
Also added container-server conf var, "allow_versions" to the Deployment
Guide.
Also changed description of proxy-server conf var,
"max_containers_whitelist" to say it contains "account names" not
"account hashes".
Change-Id: Ib23c6118cc5195cc04765afd28e442e4c735f0d4
We're still using saio:11000 in a few spots so a few things
don't work out of the box on the saio. Fixes bug #1024561
Change-Id: I226de54c2785b0d0b681c8d0cc24260adbd3d663
Expand recon middleware to include support for account and container
servers in addition to the existing object servers. Also add support
for retrieving recent information from auditors, replicators, and
updaters. In the case of certain checks (such as container auditors)
the stats returned are only for the most recent path processed.
The middleware has also been refactored and should now also handle
errors better in cases where stats are unavailable.
While new check's have been added the output from pre-existing
check's has not changed. This should allow existing 3rd party
utilities such as the Swift ZenPack to continue to function.
Change-Id: Ib9893a77b9b8a2f03179f2a73639bc4a6e264df7
Documentation, including a list of metrics reported and their semantics,
is in the Admin Guide in a new section, "Reporting Metrics to StatsD".
An optional "metric prefix" may be configured which will be prepended to
every metric name sent to StatsD.
Here is the rationale for doing a deep integration like this versus only
sending metrics to StatsD in middleware. It's the only way to report
some internal activities of Swift in a real-time manner. So to have one
way of reporting to StatsD and one place/style of configuration, even
some things (like, say, timing of PUT requests into the proxy-server)
which could be logged via middleware are consistently logged the same
way (deep integration via the logger delegate methods).
When log_statsd_host is configured, get_logger() injects a
swift.common.utils.StatsdClient object into the logger as
logger.statsd_client. Then a set of delegate methods on LogAdapter
either pass through to the StatsdClient object or become no-ops. This
allows StatsD logging to look like:
self.logger.increment('some.metric.here')
and do the right thing in all cases and with no messy conditional logic.
I wanted to use the pystatsd module for the StatsD client, but the
version on PyPi is lagging the git repo (and is missing both the prefix
functionality and timing_since() method). So I wrote my
swift.common.utils.StatsdClient. The interface is the same as
pystatsd.Client, but the code was written from scratch. It's pretty
simple, and the tests I added cover it. This also frees Swift from an
optional dependency on the pystatsd module, making this feature easier
to enable.
There's test coverage for the new code and all existing tests continue
to pass.
Refactored out _one_audit_pass() method in swift/account/auditor.py and
swift/container/auditor.py.
Fixed some misc. PEP8 violations.
Misc test cleanups and refactorings (particularly the way "fake logging"
is handled).
Change-Id: Ie968a9ae8771f59ee7591e2ae11999c44bfe33b2
Corrected its/it's mistakes, harmonized line wrapping within some docs
and clarified doc wording in several places.
Change-Id: Ib9ac6d5e859f770a702e1fad6de8d4abe0390b47
Add's the configuration file option "dump_json" or command line
options [-j|--dump-json] to have swift-dispersion-report output
the report in json format. This allows the dispersion report to
be more easily consumed elsewhere.
There's also a few pep8 fixes and removal of unused imports.
Change-Id: I2374311ccbef43e6bbae24665c9584e60f3da173