diff --git a/doc/source/admin_guide.rst b/doc/source/admin_guide.rst index d8175e25ce..c87d7edffa 100644 --- a/doc/source/admin_guide.rst +++ b/doc/source/admin_guide.rst @@ -883,450 +883,28 @@ of async_pendings in real-time, but will not tell you the current number of async_pending container updates on disk at any point in time. Note also that the set of metrics collected, their names, and their semantics -are not locked down and will change over time. +are not locked down and will change over time. For more details, see the +service-specific tables listed below: -Metrics for `account-auditor`: - -========================== ========================================================= -Metric Name Description --------------------------- --------------------------------------------------------- -`account-auditor.errors` Count of audit runs (across all account databases) which - caught an Exception. -`account-auditor.passes` Count of individual account databases which passed audit. -`account-auditor.failures` Count of individual account databases which failed audit. -`account-auditor.timing` Timing data for individual account database audits. -========================== ========================================================= - -Metrics for `account-reaper`: - -============================================== ==================================================== -Metric Name Description ----------------------------------------------- ---------------------------------------------------- -`account-reaper.errors` Count of devices failing the mount check. -`account-reaper.timing` Timing data for each reap_account() call. -`account-reaper.return_codes.X` Count of HTTP return codes from various operations - (e.g. object listing, container deletion, etc.). The - value for X is the first digit of the return code - (2 for 201, 4 for 404, etc.). -`account-reaper.containers_failures` Count of failures to delete a container. -`account-reaper.containers_deleted` Count of containers successfully deleted. -`account-reaper.containers_remaining` Count of containers which failed to delete with - zero successes. -`account-reaper.containers_possibly_remaining` Count of containers which failed to delete with - at least one success. -`account-reaper.objects_failures` Count of failures to delete an object. -`account-reaper.objects_deleted` Count of objects successfully deleted. -`account-reaper.objects_remaining` Count of objects which failed to delete with zero - successes. -`account-reaper.objects_possibly_remaining` Count of objects which failed to delete with at - least one success. -============================================== ==================================================== - -Metrics for `account-server` ("Not Found" is not considered an error and requests -which increment `errors` are not included in the timing data): - -======================================== ======================================================= -Metric Name Description ----------------------------------------- ------------------------------------------------------- -`account-server.DELETE.errors.timing` Timing data for each DELETE request resulting in an - error: bad request, not mounted, missing timestamp. -`account-server.DELETE.timing` Timing data for each DELETE request not resulting in - an error. -`account-server.PUT.errors.timing` Timing data for each PUT request resulting in an error: - bad request, not mounted, conflict, recently-deleted. -`account-server.PUT.timing` Timing data for each PUT request not resulting in an - error. -`account-server.HEAD.errors.timing` Timing data for each HEAD request resulting in an - error: bad request, not mounted. -`account-server.HEAD.timing` Timing data for each HEAD request not resulting in - an error. -`account-server.GET.errors.timing` Timing data for each GET request resulting in an - error: bad request, not mounted, bad delimiter, - account listing limit too high, bad accept header. -`account-server.GET.timing` Timing data for each GET request not resulting in - an error. -`account-server.REPLICATE.errors.timing` Timing data for each REPLICATE request resulting in an - error: bad request, not mounted. -`account-server.REPLICATE.timing` Timing data for each REPLICATE request not resulting - in an error. -`account-server.POST.errors.timing` Timing data for each POST request resulting in an - error: bad request, bad or missing timestamp, not - mounted. -`account-server.POST.timing` Timing data for each POST request not resulting in - an error. -======================================== ======================================================= - -Metrics for `account-replicator`: - -===================================== ==================================================== -Metric Name Description -------------------------------------- ---------------------------------------------------- -`account-replicator.diffs` Count of syncs handled by sending differing rows. -`account-replicator.diff_caps` Count of "diffs" operations which failed because - "max_diffs" was hit. -`account-replicator.no_changes` Count of accounts found to be in sync. -`account-replicator.hashmatches` Count of accounts found to be in sync via hash - comparison (`broker.merge_syncs` was called). -`account-replicator.rsyncs` Count of completely missing accounts which were sent - via rsync. -`account-replicator.remote_merges` Count of syncs handled by sending entire database - via rsync. -`account-replicator.attempts` Count of database replication attempts. -`account-replicator.failures` Count of database replication attempts which failed - due to corruption (quarantined) or inability to read - as well as attempts to individual nodes which - failed. -`account-replicator.removes.` Count of databases on deleted because the - delete_timestamp was greater than the put_timestamp - and the database had no rows or because it was - successfully sync'ed to other locations and doesn't - belong here anymore. -`account-replicator.successes` Count of replication attempts to an individual node - which were successful. -`account-replicator.timing` Timing data for each database replication attempt - not resulting in a failure. -===================================== ==================================================== - -Metrics for `container-auditor`: - -============================ ==================================================== -Metric Name Description ----------------------------- ---------------------------------------------------- -`container-auditor.errors` Incremented when an Exception is caught in an audit - pass (only once per pass, max). -`container-auditor.passes` Count of individual containers passing an audit. -`container-auditor.failures` Count of individual containers failing an audit. -`container-auditor.timing` Timing data for each container audit. -============================ ==================================================== - -Metrics for `container-replicator`: - -======================================= ==================================================== -Metric Name Description ---------------------------------------- ---------------------------------------------------- -`container-replicator.diffs` Count of syncs handled by sending differing rows. -`container-replicator.diff_caps` Count of "diffs" operations which failed because - "max_diffs" was hit. -`container-replicator.no_changes` Count of containers found to be in sync. -`container-replicator.hashmatches` Count of containers found to be in sync via hash - comparison (`broker.merge_syncs` was called). -`container-replicator.rsyncs` Count of completely missing containers where were sent - via rsync. -`container-replicator.remote_merges` Count of syncs handled by sending entire database - via rsync. -`container-replicator.attempts` Count of database replication attempts. -`container-replicator.failures` Count of database replication attempts which failed - due to corruption (quarantined) or inability to read - as well as attempts to individual nodes which - failed. -`container-replicator.removes.` Count of databases deleted on because the - delete_timestamp was greater than the put_timestamp - and the database had no rows or because it was - successfully sync'ed to other locations and doesn't - belong here anymore. -`container-replicator.successes` Count of replication attempts to an individual node - which were successful. -`container-replicator.timing` Timing data for each database replication attempt - not resulting in a failure. -======================================= ==================================================== - -Metrics for `container-server` ("Not Found" is not considered an error and requests -which increment `errors` are not included in the timing data): - -========================================== ==================================================== -Metric Name Description ------------------------------------------- ---------------------------------------------------- -`container-server.DELETE.errors.timing` Timing data for DELETE request errors: bad request, - not mounted, missing timestamp, conflict. -`container-server.DELETE.timing` Timing data for each DELETE request not resulting in - an error. -`container-server.PUT.errors.timing` Timing data for PUT request errors: bad request, - missing timestamp, not mounted, conflict. -`container-server.PUT.timing` Timing data for each PUT request not resulting in an - error. -`container-server.HEAD.errors.timing` Timing data for HEAD request errors: bad request, - not mounted. -`container-server.HEAD.timing` Timing data for each HEAD request not resulting in - an error. -`container-server.GET.errors.timing` Timing data for GET request errors: bad request, - not mounted, parameters not utf8, bad accept header. -`container-server.GET.timing` Timing data for each GET request not resulting in - an error. -`container-server.REPLICATE.errors.timing` Timing data for REPLICATE request errors: bad - request, not mounted. -`container-server.REPLICATE.timing` Timing data for each REPLICATE request not resulting - in an error. -`container-server.POST.errors.timing` Timing data for POST request errors: bad request, - bad x-container-sync-to, not mounted. -`container-server.POST.timing` Timing data for each POST request not resulting in - an error. -========================================== ==================================================== - -Metrics for `container-sync`: - -=============================== ==================================================== -Metric Name Description -------------------------------- ---------------------------------------------------- -`container-sync.skips` Count of containers skipped because they don't have - sync'ing enabled. -`container-sync.failures` Count of failures sync'ing of individual containers. -`container-sync.syncs` Count of individual containers sync'ed successfully. -`container-sync.deletes` Count of container database rows sync'ed by - deletion. -`container-sync.deletes.timing` Timing data for each container database row - synchronization via deletion. -`container-sync.puts` Count of container database rows sync'ed by Putting. -`container-sync.puts.timing` Timing data for each container database row - synchronization via Putting. -=============================== ==================================================== - -Metrics for `container-updater`: - -============================== ==================================================== -Metric Name Description ------------------------------- ---------------------------------------------------- -`container-updater.successes` Count of containers which successfully updated their - account. -`container-updater.failures` Count of containers which failed to update their - account. -`container-updater.no_changes` Count of containers which didn't need to update - their account. -`container-updater.timing` Timing data for processing a container; only - includes timing for containers which needed to - update their accounts (i.e. "successes" and - "failures" but not "no_changes"). -============================== ==================================================== - -Metrics for `object-auditor`: - -============================ ==================================================== -Metric Name Description ----------------------------- ---------------------------------------------------- -`object-auditor.quarantines` Count of objects failing audit and quarantined. -`object-auditor.errors` Count of errors encountered while auditing objects. -`object-auditor.timing` Timing data for each object audit (does not include - any rate-limiting sleep time for - max_files_per_second, but does include rate-limiting - sleep time for max_bytes_per_second). -============================ ==================================================== - -Metrics for `object-expirer`: - -======================== ==================================================== -Metric Name Description ------------------------- ---------------------------------------------------- -`object-expirer.objects` Count of objects expired. -`object-expirer.errors` Count of errors encountered while attempting to - expire an object. -`object-expirer.timing` Timing data for each object expiration attempt, - including ones resulting in an error. -======================== ==================================================== - -Metrics for `object-reconstructor`: - -====================================================== ====================================================== -Metric Name Description ------------------------------------------------------- ------------------------------------------------------ -`object-reconstructor.partition.delete.count.` A count of partitions on which were - reconstructed and synced to another node because they - didn't belong on this node. This metric is tracked - per-device to allow for "quiescence detection" for - object reconstruction activity on each device. -`object-reconstructor.partition.delete.timing` Timing data for partitions reconstructed and synced to - another node because they didn't belong on this node. - This metric is not tracked per device. -`object-reconstructor.partition.update.count.` A count of partitions on which were - reconstructed and synced to another node, but also - belong on this node. As with delete.count, this metric - is tracked per-device. -`object-reconstructor.partition.update.timing` Timing data for partitions reconstructed which also - belong on this node. This metric is not tracked - per-device. -`object-reconstructor.suffix.hashes` Count of suffix directories whose hash (of filenames) - was recalculated. -`object-reconstructor.suffix.syncs` Count of suffix directories reconstructed with ssync. -====================================================== ====================================================== - -Metrics for `object-replicator`: - -=================================================== ==================================================== -Metric Name Description ---------------------------------------------------- ---------------------------------------------------- -`object-replicator.partition.delete.count.` A count of partitions on which were - replicated to another node because they didn't - belong on this node. This metric is tracked - per-device to allow for "quiescence detection" for - object replication activity on each device. -`object-replicator.partition.delete.timing` Timing data for partitions replicated to another - node because they didn't belong on this node. This - metric is not tracked per device. -`object-replicator.partition.update.count.` A count of partitions on which were - replicated to another node, but also belong on this - node. As with delete.count, this metric is tracked - per-device. -`object-replicator.partition.update.timing` Timing data for partitions replicated which also - belong on this node. This metric is not tracked - per-device. -`object-replicator.suffix.hashes` Count of suffix directories whose hash (of filenames) - was recalculated. -`object-replicator.suffix.syncs` Count of suffix directories replicated with rsync. -=================================================== ==================================================== - -Metrics for `object-server`: - -======================================= ==================================================== -Metric Name Description ---------------------------------------- ---------------------------------------------------- -`object-server.quarantines` Count of objects (files) found bad and moved to - quarantine. -`object-server.async_pendings` Count of container updates saved as async_pendings - (may result from PUT or DELETE requests). -`object-server.POST.errors.timing` Timing data for POST request errors: bad request, - missing timestamp, delete-at in past, not mounted. -`object-server.POST.timing` Timing data for each POST request not resulting in - an error. -`object-server.PUT.errors.timing` Timing data for PUT request errors: bad request, - not mounted, missing timestamp, object creation - constraint violation, delete-at in past. -`object-server.PUT.timeouts` Count of object PUTs which exceeded max_upload_time. -`object-server.PUT.timing` Timing data for each PUT request not resulting in an - error. -`object-server.PUT..timing` Timing data per kB transferred (ms/kB) for each - non-zero-byte PUT request on each device. - Monitoring problematic devices, higher is bad. -`object-server.GET.errors.timing` Timing data for GET request errors: bad request, - not mounted, header timestamps before the epoch, - precondition failed. - File errors resulting in a quarantine are not - counted here. -`object-server.GET.timing` Timing data for each GET request not resulting in an - error. Includes requests which couldn't find the - object (including disk errors resulting in file - quarantine). -`object-server.HEAD.errors.timing` Timing data for HEAD request errors: bad request, - not mounted. -`object-server.HEAD.timing` Timing data for each HEAD request not resulting in - an error. Includes requests which couldn't find the - object (including disk errors resulting in file - quarantine). -`object-server.DELETE.errors.timing` Timing data for DELETE request errors: bad request, - missing timestamp, not mounted, precondition - failed. Includes requests which couldn't find or - match the object. -`object-server.DELETE.timing` Timing data for each DELETE request not resulting - in an error. -`object-server.REPLICATE.errors.timing` Timing data for REPLICATE request errors: bad - request, not mounted. -`object-server.REPLICATE.timing` Timing data for each REPLICATE request not resulting - in an error. -======================================= ==================================================== - -Metrics for `object-updater`: - -============================ ==================================================== -Metric Name Description ----------------------------- ---------------------------------------------------- -`object-updater.errors` Count of drives not mounted or async_pending files - with an unexpected name. -`object-updater.timing` Timing data for object sweeps to flush async_pending - container updates. Does not include object sweeps - which did not find an existing async_pending storage - directory. -`object-updater.quarantines` Count of async_pending container updates which were - corrupted and moved to quarantine. -`object-updater.successes` Count of successful container updates. -`object-updater.failures` Count of failed container updates. -`object-updater.unlinks` Count of async_pending files unlinked. An - async_pending file is unlinked either when it is - successfully processed or when the replicator sees - that there is a newer async_pending file for the - same object. -============================ ==================================================== - -Metrics for `proxy-server` (in the table, `` is the proxy-server -controller responsible for the request and will be one of "account", -"container", or "object"): - -======================================== ==================================================== -Metric Name Description ----------------------------------------- ---------------------------------------------------- -`proxy-server.errors` Count of errors encountered while serving requests - before the controller type is determined. Includes - invalid Content-Length, errors finding the internal - controller to handle the request, invalid utf8, and - bad URLs. -`proxy-server..handoff_count` Count of node hand-offs; only tracked if log_handoffs - is set in the proxy-server config. -`proxy-server..handoff_all_count` Count of times *only* hand-off locations were - utilized; only tracked if log_handoffs is set in the - proxy-server config. -`proxy-server..client_timeouts` Count of client timeouts (client did not read within - `client_timeout` seconds during a GET or did not - supply data within `client_timeout` seconds during - a PUT). -`proxy-server..client_disconnects` Count of detected client disconnects during PUT - operations (does NOT include caught Exceptions in - the proxy-server which caused a client disconnect). -======================================== ==================================================== - -Metrics for `proxy-logging` middleware (in the table, `` is either the -proxy-server controller responsible for the request: "account", "container", -"object", or the string "SOS" if the request came from the `Swift Origin Server`_ -middleware. The `` portion will be one of "GET", "HEAD", "POST", "PUT", -"DELETE", "COPY", "OPTIONS", or "BAD_METHOD". The list of valid HTTP methods -is configurable via the `log_statsd_valid_http_methods` config variable and -the default setting yields the above behavior): - -.. _Swift Origin Server: https://github.com/dpgoetz/sos - -==================================================== ============================================ -Metric Name Description ----------------------------------------------------- -------------------------------------------- -`proxy-server....timing` Timing data for requests, start to finish. - The portion is the numeric HTTP - status code for the request (e.g. "200" or - "404"). -`proxy-server..GET..first-byte.timing` Timing data up to completion of sending the - response headers (only for GET requests). - and are as for the main - timing metric. -`proxy-server....xfer` This counter metric is the sum of bytes - transferred in (from clients) and out (to - clients) for requests. The , , - and portions of the metric are just - like the main timing metric. -==================================================== ============================================ - -The `proxy-logging` middleware also groups these metrics by policy. The -`` portion represents a policy index): - -========================================================================== ===================================== -Metric Name Description --------------------------------------------------------------------------- ------------------------------------- -`proxy-server.object.policy....timing` Timing data for requests, aggregated - by policy index. -`proxy-server.object.policy..GET..first-byte.timing` Timing data up to completion of - sending the response headers, - aggregated by policy index. -`proxy-server.object.policy....xfer` Sum of bytes transferred in and out, - aggregated by policy index. -========================================================================== ===================================== - -Metrics for `tempauth` middleware (in the table, `` represents -the actual configured reseller_prefix or "`NONE`" if the reseller_prefix is the -empty string): - -========================================= ==================================================== -Metric Name Description ------------------------------------------ ---------------------------------------------------- -`tempauth..unauthorized` Count of regular requests which were denied with - HTTPUnauthorized. -`tempauth..forbidden` Count of regular requests which were denied with - HTTPForbidden. -`tempauth..token_denied` Count of token requests which were denied. -`tempauth..errors` Count of errors. -========================================= ==================================================== +.. toctree:: + metrics/account_auditor + metrics/account_reaper + metrics/account_server + metrics/account_replicator + metrics/container_auditor + metrics/container_replicator + metrics/container_server + metrics/container_sync + metrics/container_updater + metrics/object_auditor + metrics/object_expirer + metrics/object_reconstructor + metrics/object_replicator + metrics/object_server + metrics/object_updater + metrics/proxy_server +Or, view :doc:`metrics/all` as one page. ------------------------ Debugging Tips and Tools diff --git a/doc/source/metrics/account_auditor.rst b/doc/source/metrics/account_auditor.rst new file mode 100644 index 0000000000..896908a517 --- /dev/null +++ b/doc/source/metrics/account_auditor.rst @@ -0,0 +1,12 @@ +``account-auditor`` Metrics +=========================== + +========================== ========================================================= +Metric Name Description +-------------------------- --------------------------------------------------------- +`account-auditor.errors` Count of audit runs (across all account databases) which + caught an Exception. +`account-auditor.passes` Count of individual account databases which passed audit. +`account-auditor.failures` Count of individual account databases which failed audit. +`account-auditor.timing` Timing data for individual account database audits. +========================== ========================================================= diff --git a/doc/source/metrics/account_reaper.rst b/doc/source/metrics/account_reaper.rst new file mode 100644 index 0000000000..f73b0db218 --- /dev/null +++ b/doc/source/metrics/account_reaper.rst @@ -0,0 +1,25 @@ +``account-reaper`` Metrics +========================== + +============================================== ==================================================== +Metric Name Description +---------------------------------------------- ---------------------------------------------------- +`account-reaper.errors` Count of devices failing the mount check. +`account-reaper.timing` Timing data for each reap_account() call. +`account-reaper.return_codes.X` Count of HTTP return codes from various operations + (e.g. object listing, container deletion, etc.). The + value for X is the first digit of the return code + (2 for 201, 4 for 404, etc.). +`account-reaper.containers_failures` Count of failures to delete a container. +`account-reaper.containers_deleted` Count of containers successfully deleted. +`account-reaper.containers_remaining` Count of containers which failed to delete with + zero successes. +`account-reaper.containers_possibly_remaining` Count of containers which failed to delete with + at least one success. +`account-reaper.objects_failures` Count of failures to delete an object. +`account-reaper.objects_deleted` Count of objects successfully deleted. +`account-reaper.objects_remaining` Count of objects which failed to delete with zero + successes. +`account-reaper.objects_possibly_remaining` Count of objects which failed to delete with at + least one success. +============================================== ==================================================== diff --git a/doc/source/metrics/account_replicator.rst b/doc/source/metrics/account_replicator.rst new file mode 100644 index 0000000000..dd1204caf7 --- /dev/null +++ b/doc/source/metrics/account_replicator.rst @@ -0,0 +1,31 @@ +``account-replicator`` Metrics +============================== + +===================================== ==================================================== +Metric Name Description +------------------------------------- ---------------------------------------------------- +`account-replicator.diffs` Count of syncs handled by sending differing rows. +`account-replicator.diff_caps` Count of "diffs" operations which failed because + "max_diffs" was hit. +`account-replicator.no_changes` Count of accounts found to be in sync. +`account-replicator.hashmatches` Count of accounts found to be in sync via hash + comparison (`broker.merge_syncs` was called). +`account-replicator.rsyncs` Count of completely missing accounts which were sent + via rsync. +`account-replicator.remote_merges` Count of syncs handled by sending entire database + via rsync. +`account-replicator.attempts` Count of database replication attempts. +`account-replicator.failures` Count of database replication attempts which failed + due to corruption (quarantined) or inability to read + as well as attempts to individual nodes which + failed. +`account-replicator.removes.` Count of databases on deleted because the + delete_timestamp was greater than the put_timestamp + and the database had no rows or because it was + successfully sync'ed to other locations and doesn't + belong here anymore. +`account-replicator.successes` Count of replication attempts to an individual node + which were successful. +`account-replicator.timing` Timing data for each database replication attempt + not resulting in a failure. +===================================== ==================================================== diff --git a/doc/source/metrics/account_server.rst b/doc/source/metrics/account_server.rst new file mode 100644 index 0000000000..66110fd99c --- /dev/null +++ b/doc/source/metrics/account_server.rst @@ -0,0 +1,37 @@ +``account-server`` Metrics +========================== + +..note:: + "Not Found" is not considered an error and requests + which increment `errors` are not included in the timing data. + +======================================== ======================================================= +Metric Name Description +---------------------------------------- ------------------------------------------------------- +`account-server.DELETE.errors.timing` Timing data for each DELETE request resulting in an + error: bad request, not mounted, missing timestamp. +`account-server.DELETE.timing` Timing data for each DELETE request not resulting in + an error. +`account-server.PUT.errors.timing` Timing data for each PUT request resulting in an error: + bad request, not mounted, conflict, recently-deleted. +`account-server.PUT.timing` Timing data for each PUT request not resulting in an + error. +`account-server.HEAD.errors.timing` Timing data for each HEAD request resulting in an + error: bad request, not mounted. +`account-server.HEAD.timing` Timing data for each HEAD request not resulting in + an error. +`account-server.GET.errors.timing` Timing data for each GET request resulting in an + error: bad request, not mounted, bad delimiter, + account listing limit too high, bad accept header. +`account-server.GET.timing` Timing data for each GET request not resulting in + an error. +`account-server.REPLICATE.errors.timing` Timing data for each REPLICATE request resulting in an + error: bad request, not mounted. +`account-server.REPLICATE.timing` Timing data for each REPLICATE request not resulting + in an error. +`account-server.POST.errors.timing` Timing data for each POST request resulting in an + error: bad request, bad or missing timestamp, not + mounted. +`account-server.POST.timing` Timing data for each POST request not resulting in + an error. +======================================== ======================================================= diff --git a/doc/source/metrics/all.rst b/doc/source/metrics/all.rst new file mode 100644 index 0000000000..bca1c10870 --- /dev/null +++ b/doc/source/metrics/all.rst @@ -0,0 +1,24 @@ +:orphan: + +All Statsd Metrics +================== + +.. include:: account_auditor.rst +.. include:: account_reaper.rst +.. include:: account_server.rst +.. include:: account_replicator.rst + +.. include:: container_auditor.rst +.. include:: container_replicator.rst +.. include:: container_server.rst +.. include:: container_sync.rst +.. include:: container_updater.rst + +.. include:: object_auditor.rst +.. include:: object_expirer.rst +.. include:: object_reconstructor.rst +.. include:: object_replicator.rst +.. include:: object_server.rst +.. include:: object_updater.rst + +.. include:: proxy_server.rst diff --git a/doc/source/metrics/container_auditor.rst b/doc/source/metrics/container_auditor.rst new file mode 100644 index 0000000000..9c1043c082 --- /dev/null +++ b/doc/source/metrics/container_auditor.rst @@ -0,0 +1,12 @@ +``container-auditor`` Metrics +============================= + +============================ ==================================================== +Metric Name Description +---------------------------- ---------------------------------------------------- +`container-auditor.errors` Incremented when an Exception is caught in an audit + pass (only once per pass, max). +`container-auditor.passes` Count of individual containers passing an audit. +`container-auditor.failures` Count of individual containers failing an audit. +`container-auditor.timing` Timing data for each container audit. +============================ ==================================================== diff --git a/doc/source/metrics/container_replicator.rst b/doc/source/metrics/container_replicator.rst new file mode 100644 index 0000000000..2f9463be68 --- /dev/null +++ b/doc/source/metrics/container_replicator.rst @@ -0,0 +1,31 @@ +``container-replicator`` Metrics +================================ + +======================================= ==================================================== +Metric Name Description +--------------------------------------- ---------------------------------------------------- +`container-replicator.diffs` Count of syncs handled by sending differing rows. +`container-replicator.diff_caps` Count of "diffs" operations which failed because + "max_diffs" was hit. +`container-replicator.no_changes` Count of containers found to be in sync. +`container-replicator.hashmatches` Count of containers found to be in sync via hash + comparison (`broker.merge_syncs` was called). +`container-replicator.rsyncs` Count of completely missing containers where were sent + via rsync. +`container-replicator.remote_merges` Count of syncs handled by sending entire database + via rsync. +`container-replicator.attempts` Count of database replication attempts. +`container-replicator.failures` Count of database replication attempts which failed + due to corruption (quarantined) or inability to read + as well as attempts to individual nodes which + failed. +`container-replicator.removes.` Count of databases deleted on because the + delete_timestamp was greater than the put_timestamp + and the database had no rows or because it was + successfully sync'ed to other locations and doesn't + belong here anymore. +`container-replicator.successes` Count of replication attempts to an individual node + which were successful. +`container-replicator.timing` Timing data for each database replication attempt + not resulting in a failure. +======================================= ==================================================== diff --git a/doc/source/metrics/container_server.rst b/doc/source/metrics/container_server.rst new file mode 100644 index 0000000000..95a94509ea --- /dev/null +++ b/doc/source/metrics/container_server.rst @@ -0,0 +1,35 @@ +``container-server`` Metrics +============================ + +.. note:: + "Not Found" is not considered an error and requests + which increment `errors` are not included in the timing data. + +========================================== ==================================================== +Metric Name Description +------------------------------------------ ---------------------------------------------------- +`container-server.DELETE.errors.timing` Timing data for DELETE request errors: bad request, + not mounted, missing timestamp, conflict. +`container-server.DELETE.timing` Timing data for each DELETE request not resulting in + an error. +`container-server.PUT.errors.timing` Timing data for PUT request errors: bad request, + missing timestamp, not mounted, conflict. +`container-server.PUT.timing` Timing data for each PUT request not resulting in an + error. +`container-server.HEAD.errors.timing` Timing data for HEAD request errors: bad request, + not mounted. +`container-server.HEAD.timing` Timing data for each HEAD request not resulting in + an error. +`container-server.GET.errors.timing` Timing data for GET request errors: bad request, + not mounted, parameters not utf8, bad accept header. +`container-server.GET.timing` Timing data for each GET request not resulting in + an error. +`container-server.REPLICATE.errors.timing` Timing data for REPLICATE request errors: bad + request, not mounted. +`container-server.REPLICATE.timing` Timing data for each REPLICATE request not resulting + in an error. +`container-server.POST.errors.timing` Timing data for POST request errors: bad request, + bad x-container-sync-to, not mounted. +`container-server.POST.timing` Timing data for each POST request not resulting in + an error. +========================================== ==================================================== diff --git a/doc/source/metrics/container_sync.rst b/doc/source/metrics/container_sync.rst new file mode 100644 index 0000000000..40a291ea73 --- /dev/null +++ b/doc/source/metrics/container_sync.rst @@ -0,0 +1,18 @@ +``container-sync`` Metrics +========================== + +=============================== ==================================================== +Metric Name Description +------------------------------- ---------------------------------------------------- +`container-sync.skips` Count of containers skipped because they don't have + sync'ing enabled. +`container-sync.failures` Count of failures sync'ing of individual containers. +`container-sync.syncs` Count of individual containers sync'ed successfully. +`container-sync.deletes` Count of container database rows sync'ed by + deletion. +`container-sync.deletes.timing` Timing data for each container database row + synchronization via deletion. +`container-sync.puts` Count of container database rows sync'ed by Putting. +`container-sync.puts.timing` Timing data for each container database row + synchronization via Putting. +=============================== ==================================================== diff --git a/doc/source/metrics/container_updater.rst b/doc/source/metrics/container_updater.rst new file mode 100644 index 0000000000..b1ce46ef9f --- /dev/null +++ b/doc/source/metrics/container_updater.rst @@ -0,0 +1,17 @@ +``container-updater`` Metrics +============================= + +============================== ==================================================== +Metric Name Description +------------------------------ ---------------------------------------------------- +`container-updater.successes` Count of containers which successfully updated their + account. +`container-updater.failures` Count of containers which failed to update their + account. +`container-updater.no_changes` Count of containers which didn't need to update + their account. +`container-updater.timing` Timing data for processing a container; only + includes timing for containers which needed to + update their accounts (i.e. "successes" and + "failures" but not "no_changes"). +============================== ==================================================== diff --git a/doc/source/metrics/object_auditor.rst b/doc/source/metrics/object_auditor.rst new file mode 100644 index 0000000000..ea0751d727 --- /dev/null +++ b/doc/source/metrics/object_auditor.rst @@ -0,0 +1,13 @@ +``object-auditor`` Metrics +========================== + +============================ ==================================================== +Metric Name Description +---------------------------- ---------------------------------------------------- +`object-auditor.quarantines` Count of objects failing audit and quarantined. +`object-auditor.errors` Count of errors encountered while auditing objects. +`object-auditor.timing` Timing data for each object audit (does not include + any rate-limiting sleep time for + max_files_per_second, but does include rate-limiting + sleep time for max_bytes_per_second). +============================ ==================================================== diff --git a/doc/source/metrics/object_expirer.rst b/doc/source/metrics/object_expirer.rst new file mode 100644 index 0000000000..3026ec9165 --- /dev/null +++ b/doc/source/metrics/object_expirer.rst @@ -0,0 +1,12 @@ +``object-expirer`` Metrics +========================== + +======================== ==================================================== +Metric Name Description +------------------------ ---------------------------------------------------- +`object-expirer.objects` Count of objects expired. +`object-expirer.errors` Count of errors encountered while attempting to + expire an object. +`object-expirer.timing` Timing data for each object expiration attempt, + including ones resulting in an error. +======================== ==================================================== diff --git a/doc/source/metrics/object_reconstructor.rst b/doc/source/metrics/object_reconstructor.rst new file mode 100644 index 0000000000..e726f74cb3 --- /dev/null +++ b/doc/source/metrics/object_reconstructor.rst @@ -0,0 +1,25 @@ +``object-reconstructor`` Metrics +================================ + +====================================================== ====================================================== +Metric Name Description +------------------------------------------------------ ------------------------------------------------------ +`object-reconstructor.partition.delete.count.` A count of partitions on which were + reconstructed and synced to another node because they + didn't belong on this node. This metric is tracked + per-device to allow for "quiescence detection" for + object reconstruction activity on each device. +`object-reconstructor.partition.delete.timing` Timing data for partitions reconstructed and synced to + another node because they didn't belong on this node. + This metric is not tracked per device. +`object-reconstructor.partition.update.count.` A count of partitions on which were + reconstructed and synced to another node, but also + belong on this node. As with delete.count, this metric + is tracked per-device. +`object-reconstructor.partition.update.timing` Timing data for partitions reconstructed which also + belong on this node. This metric is not tracked + per-device. +`object-reconstructor.suffix.hashes` Count of suffix directories whose hash (of filenames) + was recalculated. +`object-reconstructor.suffix.syncs` Count of suffix directories reconstructed with ssync. +====================================================== ====================================================== diff --git a/doc/source/metrics/object_replicator.rst b/doc/source/metrics/object_replicator.rst new file mode 100644 index 0000000000..d0267d4e3c --- /dev/null +++ b/doc/source/metrics/object_replicator.rst @@ -0,0 +1,25 @@ +``object-replicator`` Metrics +============================= + +=================================================== ==================================================== +Metric Name Description +--------------------------------------------------- ---------------------------------------------------- +`object-replicator.partition.delete.count.` A count of partitions on which were + replicated to another node because they didn't + belong on this node. This metric is tracked + per-device to allow for "quiescence detection" for + object replication activity on each device. +`object-replicator.partition.delete.timing` Timing data for partitions replicated to another + node because they didn't belong on this node. This + metric is not tracked per device. +`object-replicator.partition.update.count.` A count of partitions on which were + replicated to another node, but also belong on this + node. As with delete.count, this metric is tracked + per-device. +`object-replicator.partition.update.timing` Timing data for partitions replicated which also + belong on this node. This metric is not tracked + per-device. +`object-replicator.suffix.hashes` Count of suffix directories whose hash (of filenames) + was recalculated. +`object-replicator.suffix.syncs` Count of suffix directories replicated with rsync. +=================================================== ==================================================== diff --git a/doc/source/metrics/object_server.rst b/doc/source/metrics/object_server.rst new file mode 100644 index 0000000000..afc56408d7 --- /dev/null +++ b/doc/source/metrics/object_server.rst @@ -0,0 +1,49 @@ +``object-server`` Metrics +========================= + +======================================= ==================================================== +Metric Name Description +--------------------------------------- ---------------------------------------------------- +`object-server.quarantines` Count of objects (files) found bad and moved to + quarantine. +`object-server.async_pendings` Count of container updates saved as async_pendings + (may result from PUT or DELETE requests). +`object-server.POST.errors.timing` Timing data for POST request errors: bad request, + missing timestamp, delete-at in past, not mounted. +`object-server.POST.timing` Timing data for each POST request not resulting in + an error. +`object-server.PUT.errors.timing` Timing data for PUT request errors: bad request, + not mounted, missing timestamp, object creation + constraint violation, delete-at in past. +`object-server.PUT.timeouts` Count of object PUTs which exceeded max_upload_time. +`object-server.PUT.timing` Timing data for each PUT request not resulting in an + error. +`object-server.PUT..timing` Timing data per kB transferred (ms/kB) for each + non-zero-byte PUT request on each device. + Monitoring problematic devices, higher is bad. +`object-server.GET.errors.timing` Timing data for GET request errors: bad request, + not mounted, header timestamps before the epoch, + precondition failed. + File errors resulting in a quarantine are not + counted here. +`object-server.GET.timing` Timing data for each GET request not resulting in an + error. Includes requests which couldn't find the + object (including disk errors resulting in file + quarantine). +`object-server.HEAD.errors.timing` Timing data for HEAD request errors: bad request, + not mounted. +`object-server.HEAD.timing` Timing data for each HEAD request not resulting in + an error. Includes requests which couldn't find the + object (including disk errors resulting in file + quarantine). +`object-server.DELETE.errors.timing` Timing data for DELETE request errors: bad request, + missing timestamp, not mounted, precondition + failed. Includes requests which couldn't find or + match the object. +`object-server.DELETE.timing` Timing data for each DELETE request not resulting + in an error. +`object-server.REPLICATE.errors.timing` Timing data for REPLICATE request errors: bad + request, not mounted. +`object-server.REPLICATE.timing` Timing data for each REPLICATE request not resulting + in an error. +======================================= ==================================================== diff --git a/doc/source/metrics/object_updater.rst b/doc/source/metrics/object_updater.rst new file mode 100644 index 0000000000..ca0eb2ad98 --- /dev/null +++ b/doc/source/metrics/object_updater.rst @@ -0,0 +1,22 @@ +``object-updater`` Metrics +========================== + +============================ ==================================================== +Metric Name Description +---------------------------- ---------------------------------------------------- +`object-updater.errors` Count of drives not mounted or async_pending files + with an unexpected name. +`object-updater.timing` Timing data for object sweeps to flush async_pending + container updates. Does not include object sweeps + which did not find an existing async_pending storage + directory. +`object-updater.quarantines` Count of async_pending container updates which were + corrupted and moved to quarantine. +`object-updater.successes` Count of successful container updates. +`object-updater.failures` Count of failed container updates. +`object-updater.unlinks` Count of async_pending files unlinked. An + async_pending file is unlinked either when it is + successfully processed or when the replicator sees + that there is a newer async_pending file for the + same object. +============================ ==================================================== diff --git a/doc/source/metrics/proxy_server.rst b/doc/source/metrics/proxy_server.rst new file mode 100644 index 0000000000..56a10773ab --- /dev/null +++ b/doc/source/metrics/proxy_server.rst @@ -0,0 +1,91 @@ +``proxy-server`` Metrics +======================== + +In the table, ```` is the proxy-server controller responsible for the +request and will be one of ``account``, ``container``, or ``object``. + +======================================== ==================================================== +Metric Name Description +---------------------------------------- ---------------------------------------------------- +`proxy-server.errors` Count of errors encountered while serving requests + before the controller type is determined. Includes + invalid Content-Length, errors finding the internal + controller to handle the request, invalid utf8, and + bad URLs. +`proxy-server..handoff_count` Count of node hand-offs; only tracked if log_handoffs + is set in the proxy-server config. +`proxy-server..handoff_all_count` Count of times *only* hand-off locations were + utilized; only tracked if log_handoffs is set in the + proxy-server config. +`proxy-server..client_timeouts` Count of client timeouts (client did not read within + `client_timeout` seconds during a GET or did not + supply data within `client_timeout` seconds during + a PUT). +`proxy-server..client_disconnects` Count of detected client disconnects during PUT + operations (does NOT include caught Exceptions in + the proxy-server which caused a client disconnect). +======================================== ==================================================== + +Additionally, middleware often emit their own metrics + +``proxy-logging`` Middleware +---------------------------- + +In the table, ```` is either the proxy-server controller responsible +for the request: ``account``, ``container``, ``object``, or the string +``SOS`` if the request came from the `Swift Origin Server`_ middleware. +The ```` portion will be one of ``GET``, ``HEAD``, ``POST``, ``PUT``, +``DELETE``, ``COPY``, ``OPTIONS``, or ``BAD_METHOD``. The list of valid +HTTP methods is configurable via the ``log_statsd_valid_http_methods`` +config variable and the default setting yields the above behavior. + +.. _Swift Origin Server: https://github.com/dpgoetz/sos + +==================================================== ============================================ +Metric Name Description +---------------------------------------------------- -------------------------------------------- +`proxy-server....timing` Timing data for requests, start to finish. + The portion is the numeric HTTP + status code for the request (e.g. "200" or + "404"). +`proxy-server..GET..first-byte.timing` Timing data up to completion of sending the + response headers (only for GET requests). + and are as for the main + timing metric. +`proxy-server....xfer` This counter metric is the sum of bytes + transferred in (from clients) and out (to + clients) for requests. The , , + and portions of the metric are just + like the main timing metric. +==================================================== ============================================ + +The ``proxy-logging`` middleware also groups these metrics by policy. The +```` portion represents a policy index: + +========================================================================== ===================================== +Metric Name Description +-------------------------------------------------------------------------- ------------------------------------- +`proxy-server.object.policy....timing` Timing data for requests, aggregated + by policy index. +`proxy-server.object.policy..GET..first-byte.timing` Timing data up to completion of + sending the response headers, + aggregated by policy index. +`proxy-server.object.policy....xfer` Sum of bytes transferred in and out, + aggregated by policy index. +========================================================================== ===================================== + +``tempauth`` Middleware +----------------------- +In the table, ```` represents the actual configured +reseller_prefix or ``NONE`` if the reseller_prefix is the empty string: + +========================================= ==================================================== +Metric Name Description +----------------------------------------- ---------------------------------------------------- +`tempauth..unauthorized` Count of regular requests which were denied with + HTTPUnauthorized. +`tempauth..forbidden` Count of regular requests which were denied with + HTTPForbidden. +`tempauth..token_denied` Count of token requests which were denied. +`tempauth..errors` Count of errors. +========================================= ====================================================