swift/etc/object-server.conf-sample

487 lines
18 KiB
Plaintext
Raw Normal View History

2010-08-20 00:42:38 +00:00
[DEFAULT]
2010-07-12 17:03:45 -05:00
# bind_ip = 0.0.0.0
bind_port = 6200
# bind_timeout = 30
# backlog = 4096
2010-07-12 17:03:45 -05:00
# user = swift
2010-08-20 00:42:38 +00:00
# swift_dir = /etc/swift
# devices = /srv/node
# mount_check = true
# disable_fallocate = false
# expiring_objects_container_divisor = 86400
# expiring_objects_account_name = expiring_objects
#
# Use an integer to override the number of pre-forked processes that will
Allow 1+ object-servers-per-disk deployment Enabled by a new > 0 integer config value, "servers_per_port" in the [DEFAULT] config section for object-server and/or replication server configs. The setting's integer value determines how many different object-server workers handle requests for any single unique local port in the ring. In this mode, the parent swift-object-server process continues to run as the original user (i.e. root if low-port binding is required), binds to all ports as defined in the ring, and forks off the specified number of workers per listen socket. The child, per-port servers drop privileges and behave pretty much how object-server workers always have, except that because the ring has unique ports per disk, the object-servers will only be handling requests for a single disk. The parent process detects dead servers and restarts them (with the correct listen socket), starts missing servers when an updated ring file is found with a device on the server with a new port, and kills extraneous servers when their port is found to no longer be in the ring. The ring files are stat'ed at most every "ring_check_interval" seconds, as configured in the object-server config (same default of 15s). Immediately stopping all swift-object-worker processes still works by sending the parent a SIGTERM. Likewise, a SIGHUP to the parent process still causes the parent process to close all listen sockets and exit, allowing existing children to finish serving their existing requests. The drop_privileges helper function now has an optional param to suppress the setsid() call, which otherwise screws up the child workers' process management. The class method RingData.load() can be told to only load the ring metadata (i.e. everything except replica2part2dev_id) with the optional kwarg, header_only=True. This is used to keep the parent and all forked off workers from unnecessarily having full copies of all storage policy rings in memory. A new helper class, swift.common.storage_policy.BindPortsCache, provides a method to return a set of all device ports in all rings for the server on which it is instantiated (identified by its set of IP addresses). The BindPortsCache instance will track mtimes of ring files, so they are not opened more frequently than necessary. This patch includes enhancements to the probe tests and object-replicator/object-reconstructor config plumbing to allow the probe tests to work correctly both in the "normal" config (same IP but unique ports for each SAIO "server") and a server-per-port setup where each SAIO "server" must have a unique IP address and unique port per disk within each "server". The main probe tests only work with 4 servers and 4 disks, but you can see the difference in the rings for the EC probe tests where there are 2 disks per server for a total of 8 disks. Specifically, swift.common.ring.utils.is_local_device() will ignore the ports when the "my_port" argument is None. Then, object-replicator and object-reconstructor both set self.bind_port to None if server_per_port is enabled. Bonus improvement for IPv6 addresses in is_local_device(). This PR for vagrant-swift-all-in-one will aid in testing this patch: https://github.com/swiftstack/vagrant-swift-all-in-one/pull/16/ Also allow SAIO to answer is_local_device() better; common SAIO setups have multiple "servers" all on the same host with different ports for the different "servers" (which happen to match the IPs specified in the rings for the devices on each of those "servers"). However, you can configure the SAIO to have different localhost IP addresses (e.g. 127.0.0.1, 127.0.0.2, etc.) in the ring and in the servers' config files' bind_ip setting. This new whataremyips() implementation combined with a little plumbing allows is_local_device() to accurately answer, even on an SAIO. In the default case (an unspecified bind_ip defaults to '0.0.0.0') as well as an explict "bind to everything" like '0.0.0.0' or '::', whataremyips() behaves as it always has, returning all IP addresses for the server. Also updated probe tests to handle each "server" in the SAIO having a unique IP address. For some (noisy) benchmarks that show servers_per_port=X is at least as good as the same number of "normal" workers: https://gist.github.com/dbishop/c214f89ca708a6b1624a#file-summary-md Benchmarks showing the benefits of I/O isolation with a small number of slow disks: https://gist.github.com/dbishop/fd0ab067babdecfb07ca#file-results-md If you were wondering what the overhead of threads_per_disk looks like: https://gist.github.com/dbishop/1d14755fedc86a161718#file-tabular_results-md DocImpact Change-Id: I2239a4000b41a7e7cc53465ce794af49d44796c6
2015-05-14 22:14:15 -07:00
# accept connections. NOTE: if servers_per_port is set, this setting is
# ignored.
# workers = auto
#
# Make object-server run this many worker processes per unique port of "local"
# ring devices across all storage policies. The default value of 0 disables this
# feature.
Allow 1+ object-servers-per-disk deployment Enabled by a new > 0 integer config value, "servers_per_port" in the [DEFAULT] config section for object-server and/or replication server configs. The setting's integer value determines how many different object-server workers handle requests for any single unique local port in the ring. In this mode, the parent swift-object-server process continues to run as the original user (i.e. root if low-port binding is required), binds to all ports as defined in the ring, and forks off the specified number of workers per listen socket. The child, per-port servers drop privileges and behave pretty much how object-server workers always have, except that because the ring has unique ports per disk, the object-servers will only be handling requests for a single disk. The parent process detects dead servers and restarts them (with the correct listen socket), starts missing servers when an updated ring file is found with a device on the server with a new port, and kills extraneous servers when their port is found to no longer be in the ring. The ring files are stat'ed at most every "ring_check_interval" seconds, as configured in the object-server config (same default of 15s). Immediately stopping all swift-object-worker processes still works by sending the parent a SIGTERM. Likewise, a SIGHUP to the parent process still causes the parent process to close all listen sockets and exit, allowing existing children to finish serving their existing requests. The drop_privileges helper function now has an optional param to suppress the setsid() call, which otherwise screws up the child workers' process management. The class method RingData.load() can be told to only load the ring metadata (i.e. everything except replica2part2dev_id) with the optional kwarg, header_only=True. This is used to keep the parent and all forked off workers from unnecessarily having full copies of all storage policy rings in memory. A new helper class, swift.common.storage_policy.BindPortsCache, provides a method to return a set of all device ports in all rings for the server on which it is instantiated (identified by its set of IP addresses). The BindPortsCache instance will track mtimes of ring files, so they are not opened more frequently than necessary. This patch includes enhancements to the probe tests and object-replicator/object-reconstructor config plumbing to allow the probe tests to work correctly both in the "normal" config (same IP but unique ports for each SAIO "server") and a server-per-port setup where each SAIO "server" must have a unique IP address and unique port per disk within each "server". The main probe tests only work with 4 servers and 4 disks, but you can see the difference in the rings for the EC probe tests where there are 2 disks per server for a total of 8 disks. Specifically, swift.common.ring.utils.is_local_device() will ignore the ports when the "my_port" argument is None. Then, object-replicator and object-reconstructor both set self.bind_port to None if server_per_port is enabled. Bonus improvement for IPv6 addresses in is_local_device(). This PR for vagrant-swift-all-in-one will aid in testing this patch: https://github.com/swiftstack/vagrant-swift-all-in-one/pull/16/ Also allow SAIO to answer is_local_device() better; common SAIO setups have multiple "servers" all on the same host with different ports for the different "servers" (which happen to match the IPs specified in the rings for the devices on each of those "servers"). However, you can configure the SAIO to have different localhost IP addresses (e.g. 127.0.0.1, 127.0.0.2, etc.) in the ring and in the servers' config files' bind_ip setting. This new whataremyips() implementation combined with a little plumbing allows is_local_device() to accurately answer, even on an SAIO. In the default case (an unspecified bind_ip defaults to '0.0.0.0') as well as an explict "bind to everything" like '0.0.0.0' or '::', whataremyips() behaves as it always has, returning all IP addresses for the server. Also updated probe tests to handle each "server" in the SAIO having a unique IP address. For some (noisy) benchmarks that show servers_per_port=X is at least as good as the same number of "normal" workers: https://gist.github.com/dbishop/c214f89ca708a6b1624a#file-summary-md Benchmarks showing the benefits of I/O isolation with a small number of slow disks: https://gist.github.com/dbishop/fd0ab067babdecfb07ca#file-results-md If you were wondering what the overhead of threads_per_disk looks like: https://gist.github.com/dbishop/1d14755fedc86a161718#file-tabular_results-md DocImpact Change-Id: I2239a4000b41a7e7cc53465ce794af49d44796c6
2015-05-14 22:14:15 -07:00
# servers_per_port = 0
#
# Maximum concurrent requests per worker
# max_clients = 1024
#
2011-01-23 13:18:28 -08:00
# You can specify default log routing here if you want:
# log_name = swift
# log_facility = LOG_LOCAL0
# log_level = INFO
# log_address = /dev/log
# The following caps the length of log lines to the value given; no limit if
# set to 0, the default.
# log_max_line_length = 0
#
# comma separated list of functions to call to setup custom log handlers.
# functions get passed: conf, name, log_to_console, log_route, fmt, logger,
# adapted_logger
# log_custom_handlers =
#
Upating proxy-server StatsD logging. Removed many StatsD logging calls in proxy-server and added swift-informant-style catch-all logging in the proxy-logger middleware. Many errors previously rolled into the "proxy-server.<type>.errors" counter will now appear broken down by response code and with timing data at: "proxy-server.<type>.<verb>.<status>.timing". Also, bytes transferred (sum of in + out) will be at: "proxy-server.<type>.<verb>.<status>.xfer". The proxy-logging middleware can get its StatsD config from standard vars in [DEFAULT] or from access_log_statsd_* config vars in its config section. Similarly to Swift Informant, request methods ("verbs") are filtered using the new proxy-logging config var, "log_statsd_valid_http_methods" which defaults to GET, HEAD, POST, PUT, DELETE, and COPY. Requests with methods not in this list use "BAD_METHOD" for <verb> in the metric name. To avoid user error, access_log_statsd_valid_http_methods is also accepted. Previously, proxy-server metrics used "Account", "Container", and "Object" for the <type>, but these are now all lowercase. Updated the admin guide's StatsD docs to reflect the above changes and also include the "proxy-server.<type>.handoff_count" and "proxy-server.<type>.handoff_all_count" metrics. The proxy server now saves off the original req.method and proxy_logging will use this if it can (both for request logging and as the "<verb>" in the statsd timing metric). This fixes bug 1025433. Removed some stale access_log_* related code in proxy/server.py. Also removed the BaseApplication/Application distinction as it's no longer necessary. Fixed up the sample config files a bit (logging lines, mostly). Fixed typo in SAIO development guide. Got proxy_logging.py test coverage to 100%. Fixed proxy_logging.py for PEP8 v1.3.2. Enhanced test.unit.FakeLogger to track more calls to enable testing StatsD metric calls. Change-Id: I45d94cb76450be96d66fcfab56359bdfdc3a2576
2012-08-19 17:44:43 -07:00
# If set, log_udp_host will override log_address
# log_udp_host =
# log_udp_port = 514
#
Upating proxy-server StatsD logging. Removed many StatsD logging calls in proxy-server and added swift-informant-style catch-all logging in the proxy-logger middleware. Many errors previously rolled into the "proxy-server.<type>.errors" counter will now appear broken down by response code and with timing data at: "proxy-server.<type>.<verb>.<status>.timing". Also, bytes transferred (sum of in + out) will be at: "proxy-server.<type>.<verb>.<status>.xfer". The proxy-logging middleware can get its StatsD config from standard vars in [DEFAULT] or from access_log_statsd_* config vars in its config section. Similarly to Swift Informant, request methods ("verbs") are filtered using the new proxy-logging config var, "log_statsd_valid_http_methods" which defaults to GET, HEAD, POST, PUT, DELETE, and COPY. Requests with methods not in this list use "BAD_METHOD" for <verb> in the metric name. To avoid user error, access_log_statsd_valid_http_methods is also accepted. Previously, proxy-server metrics used "Account", "Container", and "Object" for the <type>, but these are now all lowercase. Updated the admin guide's StatsD docs to reflect the above changes and also include the "proxy-server.<type>.handoff_count" and "proxy-server.<type>.handoff_all_count" metrics. The proxy server now saves off the original req.method and proxy_logging will use this if it can (both for request logging and as the "<verb>" in the statsd timing metric). This fixes bug 1025433. Removed some stale access_log_* related code in proxy/server.py. Also removed the BaseApplication/Application distinction as it's no longer necessary. Fixed up the sample config files a bit (logging lines, mostly). Fixed typo in SAIO development guide. Got proxy_logging.py test coverage to 100%. Fixed proxy_logging.py for PEP8 v1.3.2. Enhanced test.unit.FakeLogger to track more calls to enable testing StatsD metric calls. Change-Id: I45d94cb76450be96d66fcfab56359bdfdc3a2576
2012-08-19 17:44:43 -07:00
# You can enable StatsD logging here:
# log_statsd_host =
Adding StatsD logging to Swift. Documentation, including a list of metrics reported and their semantics, is in the Admin Guide in a new section, "Reporting Metrics to StatsD". An optional "metric prefix" may be configured which will be prepended to every metric name sent to StatsD. Here is the rationale for doing a deep integration like this versus only sending metrics to StatsD in middleware. It's the only way to report some internal activities of Swift in a real-time manner. So to have one way of reporting to StatsD and one place/style of configuration, even some things (like, say, timing of PUT requests into the proxy-server) which could be logged via middleware are consistently logged the same way (deep integration via the logger delegate methods). When log_statsd_host is configured, get_logger() injects a swift.common.utils.StatsdClient object into the logger as logger.statsd_client. Then a set of delegate methods on LogAdapter either pass through to the StatsdClient object or become no-ops. This allows StatsD logging to look like: self.logger.increment('some.metric.here') and do the right thing in all cases and with no messy conditional logic. I wanted to use the pystatsd module for the StatsD client, but the version on PyPi is lagging the git repo (and is missing both the prefix functionality and timing_since() method). So I wrote my swift.common.utils.StatsdClient. The interface is the same as pystatsd.Client, but the code was written from scratch. It's pretty simple, and the tests I added cover it. This also frees Swift from an optional dependency on the pystatsd module, making this feature easier to enable. There's test coverage for the new code and all existing tests continue to pass. Refactored out _one_audit_pass() method in swift/account/auditor.py and swift/container/auditor.py. Fixed some misc. PEP8 violations. Misc test cleanups and refactorings (particularly the way "fake logging" is handled). Change-Id: Ie968a9ae8771f59ee7591e2ae11999c44bfe33b2
2012-04-01 16:47:08 -07:00
# log_statsd_port = 8125
# log_statsd_default_sample_rate = 1.0
# log_statsd_sample_rate_factor = 1.0
Adding StatsD logging to Swift. Documentation, including a list of metrics reported and their semantics, is in the Admin Guide in a new section, "Reporting Metrics to StatsD". An optional "metric prefix" may be configured which will be prepended to every metric name sent to StatsD. Here is the rationale for doing a deep integration like this versus only sending metrics to StatsD in middleware. It's the only way to report some internal activities of Swift in a real-time manner. So to have one way of reporting to StatsD and one place/style of configuration, even some things (like, say, timing of PUT requests into the proxy-server) which could be logged via middleware are consistently logged the same way (deep integration via the logger delegate methods). When log_statsd_host is configured, get_logger() injects a swift.common.utils.StatsdClient object into the logger as logger.statsd_client. Then a set of delegate methods on LogAdapter either pass through to the StatsdClient object or become no-ops. This allows StatsD logging to look like: self.logger.increment('some.metric.here') and do the right thing in all cases and with no messy conditional logic. I wanted to use the pystatsd module for the StatsD client, but the version on PyPi is lagging the git repo (and is missing both the prefix functionality and timing_since() method). So I wrote my swift.common.utils.StatsdClient. The interface is the same as pystatsd.Client, but the code was written from scratch. It's pretty simple, and the tests I added cover it. This also frees Swift from an optional dependency on the pystatsd module, making this feature easier to enable. There's test coverage for the new code and all existing tests continue to pass. Refactored out _one_audit_pass() method in swift/account/auditor.py and swift/container/auditor.py. Fixed some misc. PEP8 violations. Misc test cleanups and refactorings (particularly the way "fake logging" is handled). Change-Id: Ie968a9ae8771f59ee7591e2ae11999c44bfe33b2
2012-04-01 16:47:08 -07:00
# log_statsd_metric_prefix =
#
# eventlet_debug = false
#
# You can set fallocate_reserve to the number of bytes or percentage of disk
# space you'd like fallocate to reserve, whether there is space for the given
# file size or not. Percentage will be used if the value ends with a '%'.
# fallocate_reserve = 1%
Object replication ssync (an rsync alternative) For this commit, ssync is just a direct replacement for how we use rsync. Assuming we switch over to ssync completely someday and drop rsync, we will then be able to improve the algorithms even further (removing local objects as we successfully transfer each one rather than waiting for whole partitions, using an index.db with hash-trees, etc., etc.) For easier review, this commit can be thought of in distinct parts: 1) New global_conf_callback functionality for allowing services to perform setup code before workers, etc. are launched. (This is then used by ssync in the object server to create a cross-worker semaphore to restrict concurrent incoming replication.) 2) A bit of shifting of items up from object server and replicator to diskfile or DEFAULT conf sections for better sharing of the same settings. conn_timeout, node_timeout, client_timeout, network_chunk_size, disk_chunk_size. 3) Modifications to the object server and replicator to optionally use ssync in place of rsync. This is done in a generic enough way that switching to FutureSync should be easy someday. 4) The biggest part, and (at least for now) completely optional part, are the new ssync_sender and ssync_receiver files. Nice and isolated for easier testing and visibility into test coverage, etc. All the usual logging, statsd, recon, etc. instrumentation is still there when using ssync, just as it is when using rsync. Beyond the essential error and exceptional condition logging, I have not added any additional instrumentation at this time. Unless there is something someone finds super pressing to have added to the logging, I think such additions would be better as separate change reviews. FOR NOW, IT IS NOT RECOMMENDED TO USE SSYNC ON PRODUCTION CLUSTERS. Some of us will be in a limited fashion to look for any subtle issues, tuning, etc. but generally ssync is an experimental feature. In its current implementation it is probably going to be a bit slower than rsync, but if all goes according to plan it will end up much faster. There are no comparisions yet between ssync and rsync other than some raw virtual machine testing I've done to show it should compete well enough once we can put it in use in the real world. If you Tweet, Google+, or whatever, be sure to indicate it's experimental. It'd be best to keep it out of deployment guides, howtos, etc. until we all figure out if we like it, find it to be stable, etc. Change-Id: If003dcc6f4109e2d2a42f4873a0779110fff16d6
2013-08-28 16:10:43 +00:00
#
# Time to wait while attempting to connect to another backend node.
# conn_timeout = 0.5
# Time to wait while sending each chunk of data to another backend node.
# node_timeout = 3
# Time to wait while sending a container update on object update.
# container_update_timeout = 1.0
Object replication ssync (an rsync alternative) For this commit, ssync is just a direct replacement for how we use rsync. Assuming we switch over to ssync completely someday and drop rsync, we will then be able to improve the algorithms even further (removing local objects as we successfully transfer each one rather than waiting for whole partitions, using an index.db with hash-trees, etc., etc.) For easier review, this commit can be thought of in distinct parts: 1) New global_conf_callback functionality for allowing services to perform setup code before workers, etc. are launched. (This is then used by ssync in the object server to create a cross-worker semaphore to restrict concurrent incoming replication.) 2) A bit of shifting of items up from object server and replicator to diskfile or DEFAULT conf sections for better sharing of the same settings. conn_timeout, node_timeout, client_timeout, network_chunk_size, disk_chunk_size. 3) Modifications to the object server and replicator to optionally use ssync in place of rsync. This is done in a generic enough way that switching to FutureSync should be easy someday. 4) The biggest part, and (at least for now) completely optional part, are the new ssync_sender and ssync_receiver files. Nice and isolated for easier testing and visibility into test coverage, etc. All the usual logging, statsd, recon, etc. instrumentation is still there when using ssync, just as it is when using rsync. Beyond the essential error and exceptional condition logging, I have not added any additional instrumentation at this time. Unless there is something someone finds super pressing to have added to the logging, I think such additions would be better as separate change reviews. FOR NOW, IT IS NOT RECOMMENDED TO USE SSYNC ON PRODUCTION CLUSTERS. Some of us will be in a limited fashion to look for any subtle issues, tuning, etc. but generally ssync is an experimental feature. In its current implementation it is probably going to be a bit slower than rsync, but if all goes according to plan it will end up much faster. There are no comparisions yet between ssync and rsync other than some raw virtual machine testing I've done to show it should compete well enough once we can put it in use in the real world. If you Tweet, Google+, or whatever, be sure to indicate it's experimental. It'd be best to keep it out of deployment guides, howtos, etc. until we all figure out if we like it, find it to be stable, etc. Change-Id: If003dcc6f4109e2d2a42f4873a0779110fff16d6
2013-08-28 16:10:43 +00:00
# Time to wait while receiving each chunk of data from a client or another
# backend node.
# client_timeout = 60
#
# network_chunk_size = 65536
# disk_chunk_size = 65536
#
Move documented reclaim_age option to correct location The reclaim_age is a DiskFile option, it doesn't make sense for two different object services or nodes to use different values. I also driveby cleanup the reclaim_age plumbing from get_hashes to cleanup_ondisk_files since it's a method on the Manager and has access to the configured reclaim_age. This fixes a bug where finalize_put wouldn't use the [DEFAULT]/object-server configured reclaim_age - which is normally benign but leads to weird behavior on DELETE requests with really small reclaim_age. There's a couple of places in the replicator and reconstructor that reach into their manager to borrow the reclaim_age when emptying out the aborted PUTs that failed to cleanup their files in tmp - but that timeout doesn't really need to be coupled with reclaim_age and that method could have just as reasonably been implemented on the Manager. UpgradeImpact: Previously the reclaim_age was documented to be configurable in various object-* services config sections, but that did not work correctly unless you also configured the option for the object-server because of REPLICATE request rehash cleanup. All object services must use the same reclaim_age. If you require a non-default reclaim age it should be set in the [DEFAULT] section. If there are different non-default values, the greater should be used for all object services and configured only in the [DEFAULT] section. If you specify a reclaim_age value in any object related config you should move it to *only* the [DEFAULT] section before you upgrade. If you configure a reclaim_age less that your consistency window you are likely to be eaten by a Grue. Closes-Bug: #1626296 Change-Id: I2b9189941ac29f6e3be69f76ff1c416315270916 Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
2016-07-25 20:10:44 +05:30
# Reclamation of tombstone files is performed primarily by the replicator and
# the reconstructor but the object-server and object-auditor also reference
# this value - it should be the same for all object services in the cluster,
# and not greater than the container services reclaim_age
# reclaim_age = 604800
#
# You can set scheduling priority of processes. Niceness values range from -20
# (most favorable to the process) to 19 (least favorable to the process).
# nice_priority =
#
# You can set I/O scheduling class and priority of processes. I/O niceness
# class values are IOPRIO_CLASS_RT (realtime), IOPRIO_CLASS_BE (best-effort) and
# IOPRIO_CLASS_IDLE (idle). I/O niceness priority is a number which goes from
# 0 to 7. The higher the value, the lower the I/O priority of the process.
# Work only with ionice_class.
# ionice_class =
# ionice_priority =
2010-08-20 00:42:38 +00:00
[pipeline:main]
pipeline = healthcheck recon object-server
2010-08-20 00:42:38 +00:00
[app:object-server]
use = egg:swift#object
2011-01-23 13:18:28 -08:00
# You can override the default log routing for this app here:
# set log_name = object-server
# set log_facility = LOG_LOCAL0
# set log_level = INFO
# set log_requests = true
# set log_address = /dev/log
#
2010-07-12 17:03:45 -05:00
# max_upload_time = 86400
#
# slow is the total amount of seconds an object PUT/DELETE request takes at
# least. If it is faster, the object server will sleep this amount of time minus
# the already passed transaction time. This is only useful for simulating slow
# devices on storage nodes during testing and development.
# slow = 0
#
# Objects smaller than this are not evicted from the buffercache once read
# keep_cache_size = 5242880
#
# If true, objects for authenticated GET requests may be kept in buffer cache
# if small enough
# keep_cache_private = false
#
2010-10-13 21:29:58 +00:00
# on PUTs, sync data every n MB
# mb_per_sync = 512
#
# Comma separated list of headers that can be set in metadata on an object.
# This list is in addition to X-Object-Meta-* headers and cannot include
# Content-Type, etag, Content-Length, or deleted
# allowed_headers = Content-Disposition, Content-Encoding, X-Delete-At, X-Object-Manifest, X-Static-Large-Object, Cache-Control, Content-Language, Expires, X-Robots-Tag
#
# auto_create_account_prefix = .
#
# The number of threads in eventlet's thread pool. Most IO will occur
# in the object server's main thread, but certain "heavy" IO
# operations will occur in separate IO threads, managed by eventlet.
#
# The default value is auto, whose actual value is dependent on the
# servers_per_port value:
#
# - When servers_per_port is zero, the default value of
# eventlet_tpool_num_threads is empty, which uses eventlet's default
# (currently 20 threads).
#
# - When servers_per_port is nonzero, the default value of
# eventlet_tpool_num_threads is 1.
#
# But you may override this value to any integer value.
#
# Note that this value is threads per object-server process, so to
# compute the total number of IO threads on a node, you must multiply
# this by the number of object-server processes on the node.
#
# eventlet_tpool_num_threads = auto
# Configure parameter for creating specific server
# To handle all verbs, including replication verbs, do not specify
# "replication_server" (this is the default). To only handle replication,
# set to a True value (e.g. "True" or "1"). To handle only non-replication
# verbs, set to "False". Unless you have a separate replication network, you
# should not specify any value for "replication_server".
# replication_server = false
Object replication ssync (an rsync alternative) For this commit, ssync is just a direct replacement for how we use rsync. Assuming we switch over to ssync completely someday and drop rsync, we will then be able to improve the algorithms even further (removing local objects as we successfully transfer each one rather than waiting for whole partitions, using an index.db with hash-trees, etc., etc.) For easier review, this commit can be thought of in distinct parts: 1) New global_conf_callback functionality for allowing services to perform setup code before workers, etc. are launched. (This is then used by ssync in the object server to create a cross-worker semaphore to restrict concurrent incoming replication.) 2) A bit of shifting of items up from object server and replicator to diskfile or DEFAULT conf sections for better sharing of the same settings. conn_timeout, node_timeout, client_timeout, network_chunk_size, disk_chunk_size. 3) Modifications to the object server and replicator to optionally use ssync in place of rsync. This is done in a generic enough way that switching to FutureSync should be easy someday. 4) The biggest part, and (at least for now) completely optional part, are the new ssync_sender and ssync_receiver files. Nice and isolated for easier testing and visibility into test coverage, etc. All the usual logging, statsd, recon, etc. instrumentation is still there when using ssync, just as it is when using rsync. Beyond the essential error and exceptional condition logging, I have not added any additional instrumentation at this time. Unless there is something someone finds super pressing to have added to the logging, I think such additions would be better as separate change reviews. FOR NOW, IT IS NOT RECOMMENDED TO USE SSYNC ON PRODUCTION CLUSTERS. Some of us will be in a limited fashion to look for any subtle issues, tuning, etc. but generally ssync is an experimental feature. In its current implementation it is probably going to be a bit slower than rsync, but if all goes according to plan it will end up much faster. There are no comparisions yet between ssync and rsync other than some raw virtual machine testing I've done to show it should compete well enough once we can put it in use in the real world. If you Tweet, Google+, or whatever, be sure to indicate it's experimental. It'd be best to keep it out of deployment guides, howtos, etc. until we all figure out if we like it, find it to be stable, etc. Change-Id: If003dcc6f4109e2d2a42f4873a0779110fff16d6
2013-08-28 16:10:43 +00:00
#
# Set to restrict the number of concurrent incoming SSYNC requests
Object replication ssync (an rsync alternative) For this commit, ssync is just a direct replacement for how we use rsync. Assuming we switch over to ssync completely someday and drop rsync, we will then be able to improve the algorithms even further (removing local objects as we successfully transfer each one rather than waiting for whole partitions, using an index.db with hash-trees, etc., etc.) For easier review, this commit can be thought of in distinct parts: 1) New global_conf_callback functionality for allowing services to perform setup code before workers, etc. are launched. (This is then used by ssync in the object server to create a cross-worker semaphore to restrict concurrent incoming replication.) 2) A bit of shifting of items up from object server and replicator to diskfile or DEFAULT conf sections for better sharing of the same settings. conn_timeout, node_timeout, client_timeout, network_chunk_size, disk_chunk_size. 3) Modifications to the object server and replicator to optionally use ssync in place of rsync. This is done in a generic enough way that switching to FutureSync should be easy someday. 4) The biggest part, and (at least for now) completely optional part, are the new ssync_sender and ssync_receiver files. Nice and isolated for easier testing and visibility into test coverage, etc. All the usual logging, statsd, recon, etc. instrumentation is still there when using ssync, just as it is when using rsync. Beyond the essential error and exceptional condition logging, I have not added any additional instrumentation at this time. Unless there is something someone finds super pressing to have added to the logging, I think such additions would be better as separate change reviews. FOR NOW, IT IS NOT RECOMMENDED TO USE SSYNC ON PRODUCTION CLUSTERS. Some of us will be in a limited fashion to look for any subtle issues, tuning, etc. but generally ssync is an experimental feature. In its current implementation it is probably going to be a bit slower than rsync, but if all goes according to plan it will end up much faster. There are no comparisions yet between ssync and rsync other than some raw virtual machine testing I've done to show it should compete well enough once we can put it in use in the real world. If you Tweet, Google+, or whatever, be sure to indicate it's experimental. It'd be best to keep it out of deployment guides, howtos, etc. until we all figure out if we like it, find it to be stable, etc. Change-Id: If003dcc6f4109e2d2a42f4873a0779110fff16d6
2013-08-28 16:10:43 +00:00
# Set to 0 for unlimited
# Note that SSYNC requests are only used by the object reconstructor or the
# object replicator when configured to use ssync.
Object replication ssync (an rsync alternative) For this commit, ssync is just a direct replacement for how we use rsync. Assuming we switch over to ssync completely someday and drop rsync, we will then be able to improve the algorithms even further (removing local objects as we successfully transfer each one rather than waiting for whole partitions, using an index.db with hash-trees, etc., etc.) For easier review, this commit can be thought of in distinct parts: 1) New global_conf_callback functionality for allowing services to perform setup code before workers, etc. are launched. (This is then used by ssync in the object server to create a cross-worker semaphore to restrict concurrent incoming replication.) 2) A bit of shifting of items up from object server and replicator to diskfile or DEFAULT conf sections for better sharing of the same settings. conn_timeout, node_timeout, client_timeout, network_chunk_size, disk_chunk_size. 3) Modifications to the object server and replicator to optionally use ssync in place of rsync. This is done in a generic enough way that switching to FutureSync should be easy someday. 4) The biggest part, and (at least for now) completely optional part, are the new ssync_sender and ssync_receiver files. Nice and isolated for easier testing and visibility into test coverage, etc. All the usual logging, statsd, recon, etc. instrumentation is still there when using ssync, just as it is when using rsync. Beyond the essential error and exceptional condition logging, I have not added any additional instrumentation at this time. Unless there is something someone finds super pressing to have added to the logging, I think such additions would be better as separate change reviews. FOR NOW, IT IS NOT RECOMMENDED TO USE SSYNC ON PRODUCTION CLUSTERS. Some of us will be in a limited fashion to look for any subtle issues, tuning, etc. but generally ssync is an experimental feature. In its current implementation it is probably going to be a bit slower than rsync, but if all goes according to plan it will end up much faster. There are no comparisions yet between ssync and rsync other than some raw virtual machine testing I've done to show it should compete well enough once we can put it in use in the real world. If you Tweet, Google+, or whatever, be sure to indicate it's experimental. It'd be best to keep it out of deployment guides, howtos, etc. until we all figure out if we like it, find it to be stable, etc. Change-Id: If003dcc6f4109e2d2a42f4873a0779110fff16d6
2013-08-28 16:10:43 +00:00
# replication_concurrency = 4
#
# Set to restrict the number of concurrent incoming SSYNC requests per
# device; set to 0 for unlimited requests per device. This can help control
# I/O to each device. This does not override replication_concurrency described
# above, so you may need to adjust both parameters depending on your hardware
# or network capacity.
# replication_concurrency_per_device = 1
#
# Number of seconds to wait for an existing replication device lock before
# giving up.
# replication_lock_timeout = 15
#
# These next two settings control when the SSYNC subrequest handler will
# abort an incoming SSYNC attempt. An abort will occur if there are at
Object replication ssync (an rsync alternative) For this commit, ssync is just a direct replacement for how we use rsync. Assuming we switch over to ssync completely someday and drop rsync, we will then be able to improve the algorithms even further (removing local objects as we successfully transfer each one rather than waiting for whole partitions, using an index.db with hash-trees, etc., etc.) For easier review, this commit can be thought of in distinct parts: 1) New global_conf_callback functionality for allowing services to perform setup code before workers, etc. are launched. (This is then used by ssync in the object server to create a cross-worker semaphore to restrict concurrent incoming replication.) 2) A bit of shifting of items up from object server and replicator to diskfile or DEFAULT conf sections for better sharing of the same settings. conn_timeout, node_timeout, client_timeout, network_chunk_size, disk_chunk_size. 3) Modifications to the object server and replicator to optionally use ssync in place of rsync. This is done in a generic enough way that switching to FutureSync should be easy someday. 4) The biggest part, and (at least for now) completely optional part, are the new ssync_sender and ssync_receiver files. Nice and isolated for easier testing and visibility into test coverage, etc. All the usual logging, statsd, recon, etc. instrumentation is still there when using ssync, just as it is when using rsync. Beyond the essential error and exceptional condition logging, I have not added any additional instrumentation at this time. Unless there is something someone finds super pressing to have added to the logging, I think such additions would be better as separate change reviews. FOR NOW, IT IS NOT RECOMMENDED TO USE SSYNC ON PRODUCTION CLUSTERS. Some of us will be in a limited fashion to look for any subtle issues, tuning, etc. but generally ssync is an experimental feature. In its current implementation it is probably going to be a bit slower than rsync, but if all goes according to plan it will end up much faster. There are no comparisions yet between ssync and rsync other than some raw virtual machine testing I've done to show it should compete well enough once we can put it in use in the real world. If you Tweet, Google+, or whatever, be sure to indicate it's experimental. It'd be best to keep it out of deployment guides, howtos, etc. until we all figure out if we like it, find it to be stable, etc. Change-Id: If003dcc6f4109e2d2a42f4873a0779110fff16d6
2013-08-28 16:10:43 +00:00
# least threshold number of failures and the value of failures / successes
# exceeds the ratio. The defaults of 100 and 1.0 means that at least 100
# failures have to occur and there have to be more failures than successes for
# an abort to occur.
# replication_failure_threshold = 100
# replication_failure_ratio = 1.0
Zero-copy object-server GET responses with splice() This commit lets the object server use splice() and tee() to move data from disk to the network without ever copying it into user space. Requires Linux. Sorry, FreeBSD folks. You still have the old mechanism, as does anyone who doesn't want to use splice. This requires a relatively recent kernel (2.6.38+) to work, which includes the two most recent Ubuntu LTS releases (Precise and Trusty) as well as RHEL 7. However, it excludes Lucid and RHEL 6. On those systems, setting "splice = on" will result in warnings in the logs but no actual use of splice. Note that this only applies to GET responses without Range headers. It can easily be extended to single-range GET requests, but this commit leaves that for future work. Same goes for PUT requests, or at least non-chunked ones. On some real hardware I had laying around (not a VM), this produced a 37% reduction in CPU usage for GETs made directly to the object server. Measurements were done by looking at /proc/<pid>/stat, specifically the utime and stime fields (user and kernel CPU jiffies, respectively). Note: There is a Python module called "splicetee" available on PyPi, but it's licensed under the GPL, so it cannot easily be added to OpenStack's requirements. That's why this patch uses ctypes instead. Also fixed a long-standing annoyance in FakeLogger: >>> fake_logger.warn('stuff') >>> fake_logger.get_lines_for_level('warn') [] >>> This, of course, is because the correct log level is 'warning'. Now you get a KeyError if you call get_lines_for_level with a bogus log level. Change-Id: Ic6d6b833a5b04ca2019be94b1b90d941929d21c8
2014-06-10 14:15:27 -07:00
#
# Use splice() for zero-copy object GETs. This requires Linux kernel
# version 3.0 or greater. If you set "splice = yes" but the kernel
# does not support it, error messages will appear in the object server
# logs at startup, but your object servers should continue to function.
#
# splice = no
#
# You can set scheduling priority of processes. Niceness values range from -20
# (most favorable to the process) to 19 (least favorable to the process).
# nice_priority =
#
# You can set I/O scheduling class and priority of processes. I/O niceness
# class values are IOPRIO_CLASS_RT (realtime), IOPRIO_CLASS_BE (best-effort) and
# IOPRIO_CLASS_IDLE (idle). I/O niceness priority is a number which goes from
# 0 to 7. The higher the value, the lower the I/O priority of the process.
# Work only with ionice_class.
# ionice_class =
# ionice_priority =
2010-07-12 17:03:45 -05:00
[filter:healthcheck]
use = egg:swift#healthcheck
# An optional filesystem path, which if present, will cause the healthcheck
# URL to return "503 Service Unavailable" with a body of "DISABLED BY FILE"
# disable_path =
[filter:recon]
use = egg:swift#recon
#recon_cache_path = /var/cache/swift
#recon_lock_path = /var/lock
2010-07-12 17:03:45 -05:00
[object-replicator]
2011-01-23 13:18:28 -08:00
# You can override the default log routing for this app here (don't use set!):
# log_name = object-replicator
2011-01-23 13:18:28 -08:00
# log_facility = LOG_LOCAL0
# log_level = INFO
# log_address = /dev/log
#
2010-07-12 17:03:45 -05:00
# daemonize = on
#
# Time in seconds to wait between replication passes
# interval = 30
# run_pause is deprecated, use interval instead
2010-07-12 17:03:45 -05:00
# run_pause = 30
#
# Number of concurrent replication jobs to run. This is per-process,
# so replicator_workers=W and concurrency=C will result in W*C
# replication jobs running at once.
2010-07-12 17:03:45 -05:00
# concurrency = 1
#
# Number of worker processes to use. No matter how big this number is,
# at most one worker per disk will be used. 0 means no forking; all work
# is done in the main process.
# replicator_workers = 0
#
# stats_interval = 300
#
# default is rsync, alternative is ssync
Object replication ssync (an rsync alternative) For this commit, ssync is just a direct replacement for how we use rsync. Assuming we switch over to ssync completely someday and drop rsync, we will then be able to improve the algorithms even further (removing local objects as we successfully transfer each one rather than waiting for whole partitions, using an index.db with hash-trees, etc., etc.) For easier review, this commit can be thought of in distinct parts: 1) New global_conf_callback functionality for allowing services to perform setup code before workers, etc. are launched. (This is then used by ssync in the object server to create a cross-worker semaphore to restrict concurrent incoming replication.) 2) A bit of shifting of items up from object server and replicator to diskfile or DEFAULT conf sections for better sharing of the same settings. conn_timeout, node_timeout, client_timeout, network_chunk_size, disk_chunk_size. 3) Modifications to the object server and replicator to optionally use ssync in place of rsync. This is done in a generic enough way that switching to FutureSync should be easy someday. 4) The biggest part, and (at least for now) completely optional part, are the new ssync_sender and ssync_receiver files. Nice and isolated for easier testing and visibility into test coverage, etc. All the usual logging, statsd, recon, etc. instrumentation is still there when using ssync, just as it is when using rsync. Beyond the essential error and exceptional condition logging, I have not added any additional instrumentation at this time. Unless there is something someone finds super pressing to have added to the logging, I think such additions would be better as separate change reviews. FOR NOW, IT IS NOT RECOMMENDED TO USE SSYNC ON PRODUCTION CLUSTERS. Some of us will be in a limited fashion to look for any subtle issues, tuning, etc. but generally ssync is an experimental feature. In its current implementation it is probably going to be a bit slower than rsync, but if all goes according to plan it will end up much faster. There are no comparisions yet between ssync and rsync other than some raw virtual machine testing I've done to show it should compete well enough once we can put it in use in the real world. If you Tweet, Google+, or whatever, be sure to indicate it's experimental. It'd be best to keep it out of deployment guides, howtos, etc. until we all figure out if we like it, find it to be stable, etc. Change-Id: If003dcc6f4109e2d2a42f4873a0779110fff16d6
2013-08-28 16:10:43 +00:00
# sync_method = rsync
#
# max duration of a partition rsync
# rsync_timeout = 900
#
# bandwidth limit for rsync in kB/s. 0 means unlimited
# rsync_bwlimit = 0
#
# passed to rsync for io op timeout
# rsync_io_timeout = 30
#
# Allow rsync to compress data which is transmitted to destination node
# during sync. However, this is applicable only when destination node is in
# a different region than the local one.
# NOTE: Objects that are already compressed (for example: .tar.gz, .mp3) might
# slow down the syncing process.
# rsync_compress = no
#
# Format of the rsync module where the replicator will send data. See
Allows to configure the rsync modules where the replicators will send data Currently, the rsync module where the replicators send data is static. It forbids administrators to set rsync configuration based on their current deployment or needs. As an example, the rsyncd configuration example encourages to set a connections limit for the modules account, container and object. It permits to protect devices from excessives parallels connections, because it would impact performances. On a server with many devices, it is tempting to increase this number proportionally, but nothing guarantees that the distribution of the connections will be balanced. In the worst scenario, a single device can receive all the connections, which is a severe impact on performances. This commit adds a new option named 'rsync_module' to the *-replicator sections of the *-server configuration file. This configuration variable can be extrapolated with device attributes like ip, port, device, zone, ... by using the format {NAME}. eg: rsync_module = {replication_ip}::object_{device} With this configuration, an administrators can solve the problem of connections distribution by creating one module per device in rsyncd configuration. The default values are backward compatible: {replication_ip}::account {replication_ip}::container {replication_ip}::object Option vm_test_mode is deprecated by this commit, but backward compatibility is maintained. The option is only effective when rsync_module is not set. In that case, {replication_port} is appended to the default value of rsync_module. Change-Id: Iad91df50dadbe96c921181797799b4444323ce2e
2015-06-16 12:47:26 +02:00
# etc/rsyncd.conf-sample for some usage examples.
# rsync_module = {replication_ip}::object
#
Object replication ssync (an rsync alternative) For this commit, ssync is just a direct replacement for how we use rsync. Assuming we switch over to ssync completely someday and drop rsync, we will then be able to improve the algorithms even further (removing local objects as we successfully transfer each one rather than waiting for whole partitions, using an index.db with hash-trees, etc., etc.) For easier review, this commit can be thought of in distinct parts: 1) New global_conf_callback functionality for allowing services to perform setup code before workers, etc. are launched. (This is then used by ssync in the object server to create a cross-worker semaphore to restrict concurrent incoming replication.) 2) A bit of shifting of items up from object server and replicator to diskfile or DEFAULT conf sections for better sharing of the same settings. conn_timeout, node_timeout, client_timeout, network_chunk_size, disk_chunk_size. 3) Modifications to the object server and replicator to optionally use ssync in place of rsync. This is done in a generic enough way that switching to FutureSync should be easy someday. 4) The biggest part, and (at least for now) completely optional part, are the new ssync_sender and ssync_receiver files. Nice and isolated for easier testing and visibility into test coverage, etc. All the usual logging, statsd, recon, etc. instrumentation is still there when using ssync, just as it is when using rsync. Beyond the essential error and exceptional condition logging, I have not added any additional instrumentation at this time. Unless there is something someone finds super pressing to have added to the logging, I think such additions would be better as separate change reviews. FOR NOW, IT IS NOT RECOMMENDED TO USE SSYNC ON PRODUCTION CLUSTERS. Some of us will be in a limited fashion to look for any subtle issues, tuning, etc. but generally ssync is an experimental feature. In its current implementation it is probably going to be a bit slower than rsync, but if all goes according to plan it will end up much faster. There are no comparisions yet between ssync and rsync other than some raw virtual machine testing I've done to show it should compete well enough once we can put it in use in the real world. If you Tweet, Google+, or whatever, be sure to indicate it's experimental. It'd be best to keep it out of deployment guides, howtos, etc. until we all figure out if we like it, find it to be stable, etc. Change-Id: If003dcc6f4109e2d2a42f4873a0779110fff16d6
2013-08-28 16:10:43 +00:00
# node_timeout = <whatever's in the DEFAULT section or 10>
# max duration of an http request; this is for REPLICATE finalization calls and
# so should be longer than node_timeout
# http_timeout = 60
#
# attempts to kill all workers if nothing replicates for lockup_timeout seconds
# lockup_timeout = 1800
#
# ring_check_interval = 15
# recon_cache_path = /var/cache/swift
#
# limits how long rsync error log lines are
# 0 means to log the entire line
# rsync_error_log_line_length = 0
#
# handoffs_first and handoff_delete are options for a special case
# such as disk full in the cluster. These two options SHOULD NOT BE
# CHANGED, except for such an extreme situations. (e.g. disks filled up
# or are about to fill up. Anyway, DO NOT let your drives fill up)
# handoffs_first is the flag to replicate handoffs prior to canonical
# partitions. It allows to force syncing and deleting handoffs quickly.
# If set to a True value(e.g. "True" or "1"), partitions
# that are not supposed to be on the node will be replicated first.
# handoffs_first = False
#
# handoff_delete is the number of replicas which are ensured in swift.
# If the number less than the number of replicas is set, object-replicator
# could delete local handoffs even if all replicas are not ensured in the
# cluster. Object-replicator would remove local handoff partition directories
# after syncing partition when the number of successful responses is greater
# than or equal to this number. By default(auto), handoff partitions will be
# removed when it has successfully replicated to all the canonical nodes.
# handoff_delete = auto
#
# You can set scheduling priority of processes. Niceness values range from -20
# (most favorable to the process) to 19 (least favorable to the process).
# nice_priority =
#
# You can set I/O scheduling class and priority of processes. I/O niceness
# class values are IOPRIO_CLASS_RT (realtime), IOPRIO_CLASS_BE (best-effort) and
# IOPRIO_CLASS_IDLE (idle). I/O niceness priority is a number which goes from
# 0 to 7. The higher the value, the lower the I/O priority of the process.
# Work only with ionice_class.
# ionice_class =
# ionice_priority =
2010-07-12 17:03:45 -05:00
[object-reconstructor]
# You can override the default log routing for this app here (don't use set!):
# Unless otherwise noted, each setting below has the same meaning as described
# in the [object-replicator] section, however these settings apply to the EC
# reconstructor
#
# log_name = object-reconstructor
# log_facility = LOG_LOCAL0
# log_level = INFO
# log_address = /dev/log
#
# daemonize = on
#
# Time in seconds to wait between reconstruction passes
# interval = 30
# run_pause is deprecated, use interval instead
# run_pause = 30
#
# Maximum number of worker processes to spawn. Each worker will handle a
# subset of devices. Devices will be assigned evenly among the workers so that
# workers cycle at similar intervals (which can lead to fewer workers than
# requested). You can not have more workers than devices. If you have no
# devices only a single worker is spawned.
# reconstructor_workers = 0
#
# concurrency = 1
# stats_interval = 300
# node_timeout = 10
# http_timeout = 60
# lockup_timeout = 1800
# ring_check_interval = 15
# recon_cache_path = /var/cache/swift
Deprecate broken handoffs_first in favor of handoffs_only The handoffs_first mode in the replicator has the useful behavior of processing all handoff parts across all disks until there aren't any handoffs anymore on the node [1] and then it seemingly tries to drop back into normal operation. In practice I've only ever heard of handoffs_first used while rebalancing and turned off as soon as the rebalance finishes - it's not recommended to run with handoffs_first mode turned on and it emits a warning on startup if option is enabled. The handoffs_first mode on the reconstructor doesn't work - it was prioritizing handoffs *per-part* [2] - which is really unfortunate because in the reconstructor during a rebalance it's often *much* more attractive from an efficiency disk/network perspective to revert a partition from a handoff than it is to rebuild an entire partition from another primary using the other EC fragments in the cluster. This change deprecates handoffs_first in favor of handoffs_only in the reconstructor which is far more useful - and just like handoffs_first mode in the replicator - it gives the operator the option of forcing the consistency engine to focus on rebalance. The handoffs_only behavior is somewhat consistent with the replicator's handoffs_first option (any error on any handoff in the replicactor will make it essentially handoff only forever) but the option does what you want and is named correctly in the reconstructor. For consistency with the replicator the reconstructor will mostly honor the handoffs_first option, but if you set handoffs_only in the config it always takes precedence. Having handoffs_first in your config always results in a warning, but if handoff_only is not set and handoffs_first is true the reconstructor will assume you need handoffs_only and behaves as such. When running in handoffs_only mode the reconstructor will start to log a warning every cycle if you leave it running in handoffs_only after it finishes reverting handoffs. However you should be monitoring on-disk partitions and disable the option as soon as the cluster finishes the full rebalance cycle. 1. Ia324728d42c606e2f9e7d29b4ab5fcbff6e47aea fixed replicator handoffs_first "mode" 2. Unlike replication each partition in a EC policy can have a different kind of job per frag_index, but the cardinality of jobs is typically only one (either sync or revert) unless there's been a bunch of errors during write and then handoffs partitions maybe hold a number of different fragments. Known-Issues: handoffs_only is not documented outside of the example config, see lp bug #1626290 Closes-Bug: #1653018 Change-Id: Idde4b6cf92fab6c45f2c0c2733277701eb436898
2017-01-25 11:51:03 -08:00
# The handoffs_only mode option is for special case emergency situations during
# rebalance such as disk full in the cluster. This option SHOULD NOT BE
# CHANGED, except for extreme situations. When handoffs_only mode is enabled
# the reconstructor will *only* revert fragments from handoff nodes to primary
# nodes and will not sync primary nodes with neighboring primary nodes. This
# will force the reconstructor to sync and delete handoffs' fragments more
# quickly and minimize the time of the rebalance by limiting the number of
# rebuilds. The handoffs_only option is only for temporary use and should be
# disabled as soon as the emergency situation has been resolved. When
# handoffs_only is not set, the deprecated handoffs_first option will be
# honored as a synonym, but may be ignored in a future release.
Deprecate broken handoffs_first in favor of handoffs_only The handoffs_first mode in the replicator has the useful behavior of processing all handoff parts across all disks until there aren't any handoffs anymore on the node [1] and then it seemingly tries to drop back into normal operation. In practice I've only ever heard of handoffs_first used while rebalancing and turned off as soon as the rebalance finishes - it's not recommended to run with handoffs_first mode turned on and it emits a warning on startup if option is enabled. The handoffs_first mode on the reconstructor doesn't work - it was prioritizing handoffs *per-part* [2] - which is really unfortunate because in the reconstructor during a rebalance it's often *much* more attractive from an efficiency disk/network perspective to revert a partition from a handoff than it is to rebuild an entire partition from another primary using the other EC fragments in the cluster. This change deprecates handoffs_first in favor of handoffs_only in the reconstructor which is far more useful - and just like handoffs_first mode in the replicator - it gives the operator the option of forcing the consistency engine to focus on rebalance. The handoffs_only behavior is somewhat consistent with the replicator's handoffs_first option (any error on any handoff in the replicactor will make it essentially handoff only forever) but the option does what you want and is named correctly in the reconstructor. For consistency with the replicator the reconstructor will mostly honor the handoffs_first option, but if you set handoffs_only in the config it always takes precedence. Having handoffs_first in your config always results in a warning, but if handoff_only is not set and handoffs_first is true the reconstructor will assume you need handoffs_only and behaves as such. When running in handoffs_only mode the reconstructor will start to log a warning every cycle if you leave it running in handoffs_only after it finishes reverting handoffs. However you should be monitoring on-disk partitions and disable the option as soon as the cluster finishes the full rebalance cycle. 1. Ia324728d42c606e2f9e7d29b4ab5fcbff6e47aea fixed replicator handoffs_first "mode" 2. Unlike replication each partition in a EC policy can have a different kind of job per frag_index, but the cardinality of jobs is typically only one (either sync or revert) unless there's been a bunch of errors during write and then handoffs partitions maybe hold a number of different fragments. Known-Issues: handoffs_only is not documented outside of the example config, see lp bug #1626290 Closes-Bug: #1653018 Change-Id: Idde4b6cf92fab6c45f2c0c2733277701eb436898
2017-01-25 11:51:03 -08:00
# handoffs_only = False
#
# You can set scheduling priority of processes. Niceness values range from -20
# (most favorable to the process) to 19 (least favorable to the process).
# nice_priority =
#
# You can set I/O scheduling class and priority of processes. I/O niceness
# class values are IOPRIO_CLASS_RT (realtime), IOPRIO_CLASS_BE (best-effort) and
# IOPRIO_CLASS_IDLE (idle). I/O niceness priority is a number which goes from
# 0 to 7. The higher the value, the lower the I/O priority of the process.
# Work only with ionice_class.
# ionice_class =
# ionice_priority =
2010-07-12 17:03:45 -05:00
[object-updater]
2011-01-23 13:18:28 -08:00
# You can override the default log routing for this app here (don't use set!):
# log_name = object-updater
2011-01-23 13:18:28 -08:00
# log_facility = LOG_LOCAL0
# log_level = INFO
# log_address = /dev/log
#
2010-07-12 17:03:45 -05:00
# interval = 300
Object replication ssync (an rsync alternative) For this commit, ssync is just a direct replacement for how we use rsync. Assuming we switch over to ssync completely someday and drop rsync, we will then be able to improve the algorithms even further (removing local objects as we successfully transfer each one rather than waiting for whole partitions, using an index.db with hash-trees, etc., etc.) For easier review, this commit can be thought of in distinct parts: 1) New global_conf_callback functionality for allowing services to perform setup code before workers, etc. are launched. (This is then used by ssync in the object server to create a cross-worker semaphore to restrict concurrent incoming replication.) 2) A bit of shifting of items up from object server and replicator to diskfile or DEFAULT conf sections for better sharing of the same settings. conn_timeout, node_timeout, client_timeout, network_chunk_size, disk_chunk_size. 3) Modifications to the object server and replicator to optionally use ssync in place of rsync. This is done in a generic enough way that switching to FutureSync should be easy someday. 4) The biggest part, and (at least for now) completely optional part, are the new ssync_sender and ssync_receiver files. Nice and isolated for easier testing and visibility into test coverage, etc. All the usual logging, statsd, recon, etc. instrumentation is still there when using ssync, just as it is when using rsync. Beyond the essential error and exceptional condition logging, I have not added any additional instrumentation at this time. Unless there is something someone finds super pressing to have added to the logging, I think such additions would be better as separate change reviews. FOR NOW, IT IS NOT RECOMMENDED TO USE SSYNC ON PRODUCTION CLUSTERS. Some of us will be in a limited fashion to look for any subtle issues, tuning, etc. but generally ssync is an experimental feature. In its current implementation it is probably going to be a bit slower than rsync, but if all goes according to plan it will end up much faster. There are no comparisions yet between ssync and rsync other than some raw virtual machine testing I've done to show it should compete well enough once we can put it in use in the real world. If you Tweet, Google+, or whatever, be sure to indicate it's experimental. It'd be best to keep it out of deployment guides, howtos, etc. until we all figure out if we like it, find it to be stable, etc. Change-Id: If003dcc6f4109e2d2a42f4873a0779110fff16d6
2013-08-28 16:10:43 +00:00
# node_timeout = <whatever's in the DEFAULT section or 10>
#
object-updater: add concurrent updates The object updater now supports two configuration settings: "concurrency" and "updater_workers". The latter controls how many worker processes are spawned, while the former controls how many concurrent container updates are performed by each worker process. This should speed the processing of async_pendings. There is a change to the semantics of the configuration options. Previously, "concurrency" controlled the number of worker processes spawned, and "updater_workers" did not exist. I switched the meanings for consistency with other configuration options. In the object reconstructor, object replicator, object server, object expirer, container replicator, container server, account replicator, account server, and account reaper, "concurrency" refers to the number of concurrent tasks performed within one process (for reference, the container updater and object auditor use "concurrency" to mean number of processes). On upgrade, a node configured with concurrency=N will still handle async updates N-at-a-time, but will do so using only one process instead of N. UpgradeImpact: If you have a config file like this: [object-updater] concurrency = <N> and you want to take advantage of faster updates, then do this: [object-updater] concurrency = 8 # the default; you can omit this line updater_workers = <N> If you want updates to be processed exactly as before, do this: [object-updater] concurrency = 1 updater_workers = <N> Change-Id: I17e18088e61f664e1b9942d66423666d0cae1689
2018-06-04 16:26:50 -07:00
# updater_workers controls how many processes the object updater will
# spawn, while concurrency controls how many async_pending records
# each updater process will operate on at any one time. With
# concurrency=C and updater_workers=W, there will be up to W*C
# async_pending records being processed at once.
# concurrency = 8
# updater_workers = 1
#
# Send at most this many object updates per second
# objects_per_second = 50
#
# slowdown will sleep that amount between objects. Deprecated; use
# objects_per_second instead.
2010-07-12 17:03:45 -05:00
# slowdown = 0.01
#
# Log stats (at INFO level) every report_interval seconds. This
# logging is per-process, so with concurrency > 1, the logs will
# contain one stats log per worker process every report_interval
# seconds.
# report_interval = 300
#
# recon_cache_path = /var/cache/swift
#
# You can set scheduling priority of processes. Niceness values range from -20
# (most favorable to the process) to 19 (least favorable to the process).
# nice_priority =
#
# You can set I/O scheduling class and priority of processes. I/O niceness
# class values are IOPRIO_CLASS_RT (realtime), IOPRIO_CLASS_BE (best-effort) and
# IOPRIO_CLASS_IDLE (idle). I/O niceness priority is a number which goes from
# 0 to 7. The higher the value, the lower the I/O priority of the process.
# Work only with ionice_class.
# ionice_class =
# ionice_priority =
2010-07-12 17:03:45 -05:00
[object-auditor]
2011-01-23 13:18:28 -08:00
# You can override the default log routing for this app here (don't use set!):
# log_name = object-auditor
2011-01-23 13:18:28 -08:00
# log_facility = LOG_LOCAL0
# log_level = INFO
# log_address = /dev/log
#
# Time in seconds to wait between auditor passes
# interval = 30
#
# You can set the disk chunk size that the auditor uses making it larger if
# you like for more efficient local auditing of larger objects
# disk_chunk_size = 65536
2010-12-28 14:54:00 -08:00
# files_per_second = 20
# concurrency = 1
2010-12-28 14:54:00 -08:00
# bytes_per_second = 10000000
# log_time = 3600
2011-02-21 16:37:12 -08:00
# zero_byte_files_per_second = 50
# recon_cache_path = /var/cache/swift
# Takes a comma separated list of ints. If set, the object auditor will
# increment a counter for every object whose size is <= to the given break
# points and report the result after a full scan.
# object_size_stats =
#
# You can set scheduling priority of processes. Niceness values range from -20
# (most favorable to the process) to 19 (least favorable to the process).
# nice_priority =
#
# You can set I/O scheduling class and priority of processes. I/O niceness
# class values are IOPRIO_CLASS_RT (realtime), IOPRIO_CLASS_BE (best-effort) and
# IOPRIO_CLASS_IDLE (idle). I/O niceness priority is a number which goes from
# 0 to 7. The higher the value, the lower the I/O priority of the process.
# Work only with ionice_class.
# ionice_class =
# ionice_priority =
# The auditor will cleanup old rsync tempfiles after they are "old
# enough" to delete. You can configure the time elapsed in seconds
# before rsync tempfiles will be unlinked, or the default value of
# "auto" try to use object-replicator's rsync_timeout + 900 and fallback
# to 86400 (1 day).
# rsync_tempfile_timeout = auto
# Note: Put it at the beginning of the pipleline to profile all middleware. But
# it is safer to put this after healthcheck.
[filter:xprofile]
use = egg:swift#xprofile
# This option enable you to switch profilers which should inherit from python
# standard profiler. Currently the supported value can be 'cProfile',
# 'eventlet.green.profile' etc.
# profile_module = eventlet.green.profile
#
# This prefix will be used to combine process ID and timestamp to name the
# profile data file. Make sure the executing user has permission to write
# into this path (missing path segments will be created, if necessary).
# If you enable profiling in more than one type of daemon, you must override
# it with an unique value like: /var/log/swift/profile/object.profile
# log_filename_prefix = /tmp/log/swift/profile/default.profile
#
# the profile data will be dumped to local disk based on above naming rule
# in this interval.
# dump_interval = 5.0
#
# Be careful, this option will enable profiler to dump data into the file with
# time stamp which means there will be lots of files piled up in the directory.
# dump_timestamp = false
#
# This is the path of the URL to access the mini web UI.
# path = /__profile__
#
# Clear the data when the wsgi server shutdown.
# flush_at_shutdown = false
#
# unwind the iterator of applications
# unwind = false