df134df901
Enabled by a new > 0 integer config value, "servers_per_port" in the [DEFAULT] config section for object-server and/or replication server configs. The setting's integer value determines how many different object-server workers handle requests for any single unique local port in the ring. In this mode, the parent swift-object-server process continues to run as the original user (i.e. root if low-port binding is required), binds to all ports as defined in the ring, and forks off the specified number of workers per listen socket. The child, per-port servers drop privileges and behave pretty much how object-server workers always have, except that because the ring has unique ports per disk, the object-servers will only be handling requests for a single disk. The parent process detects dead servers and restarts them (with the correct listen socket), starts missing servers when an updated ring file is found with a device on the server with a new port, and kills extraneous servers when their port is found to no longer be in the ring. The ring files are stat'ed at most every "ring_check_interval" seconds, as configured in the object-server config (same default of 15s). Immediately stopping all swift-object-worker processes still works by sending the parent a SIGTERM. Likewise, a SIGHUP to the parent process still causes the parent process to close all listen sockets and exit, allowing existing children to finish serving their existing requests. The drop_privileges helper function now has an optional param to suppress the setsid() call, which otherwise screws up the child workers' process management. The class method RingData.load() can be told to only load the ring metadata (i.e. everything except replica2part2dev_id) with the optional kwarg, header_only=True. This is used to keep the parent and all forked off workers from unnecessarily having full copies of all storage policy rings in memory. A new helper class, swift.common.storage_policy.BindPortsCache, provides a method to return a set of all device ports in all rings for the server on which it is instantiated (identified by its set of IP addresses). The BindPortsCache instance will track mtimes of ring files, so they are not opened more frequently than necessary. This patch includes enhancements to the probe tests and object-replicator/object-reconstructor config plumbing to allow the probe tests to work correctly both in the "normal" config (same IP but unique ports for each SAIO "server") and a server-per-port setup where each SAIO "server" must have a unique IP address and unique port per disk within each "server". The main probe tests only work with 4 servers and 4 disks, but you can see the difference in the rings for the EC probe tests where there are 2 disks per server for a total of 8 disks. Specifically, swift.common.ring.utils.is_local_device() will ignore the ports when the "my_port" argument is None. Then, object-replicator and object-reconstructor both set self.bind_port to None if server_per_port is enabled. Bonus improvement for IPv6 addresses in is_local_device(). This PR for vagrant-swift-all-in-one will aid in testing this patch: https://github.com/swiftstack/vagrant-swift-all-in-one/pull/16/ Also allow SAIO to answer is_local_device() better; common SAIO setups have multiple "servers" all on the same host with different ports for the different "servers" (which happen to match the IPs specified in the rings for the devices on each of those "servers"). However, you can configure the SAIO to have different localhost IP addresses (e.g. 127.0.0.1, 127.0.0.2, etc.) in the ring and in the servers' config files' bind_ip setting. This new whataremyips() implementation combined with a little plumbing allows is_local_device() to accurately answer, even on an SAIO. In the default case (an unspecified bind_ip defaults to '0.0.0.0') as well as an explict "bind to everything" like '0.0.0.0' or '::', whataremyips() behaves as it always has, returning all IP addresses for the server. Also updated probe tests to handle each "server" in the SAIO having a unique IP address. For some (noisy) benchmarks that show servers_per_port=X is at least as good as the same number of "normal" workers: https://gist.github.com/dbishop/c214f89ca708a6b1624a#file-summary-md Benchmarks showing the benefits of I/O isolation with a small number of slow disks: https://gist.github.com/dbishop/fd0ab067babdecfb07ca#file-results-md If you were wondering what the overhead of threads_per_disk looks like: https://gist.github.com/dbishop/1d14755fedc86a161718#file-tabular_results-md DocImpact Change-Id: I2239a4000b41a7e7cc53465ce794af49d44796c6
330 lines
11 KiB
Plaintext
330 lines
11 KiB
Plaintext
[DEFAULT]
|
|
# bind_ip = 0.0.0.0
|
|
bind_port = 6000
|
|
# bind_timeout = 30
|
|
# backlog = 4096
|
|
# user = swift
|
|
# swift_dir = /etc/swift
|
|
# devices = /srv/node
|
|
# mount_check = true
|
|
# disable_fallocate = false
|
|
# expiring_objects_container_divisor = 86400
|
|
# expiring_objects_account_name = expiring_objects
|
|
#
|
|
# Use an integer to override the number of pre-forked processes that will
|
|
# accept connections. NOTE: if servers_per_port is set, this setting is
|
|
# ignored.
|
|
# workers = auto
|
|
#
|
|
# Make object-server run this many worker processes per unique port of
|
|
# "local" ring devices across all storage policies. This can help provide
|
|
# the isolation of threads_per_disk without the severe overhead. The default
|
|
# value of 0 disables this feature.
|
|
# servers_per_port = 0
|
|
#
|
|
# Maximum concurrent requests per worker
|
|
# max_clients = 1024
|
|
#
|
|
# You can specify default log routing here if you want:
|
|
# log_name = swift
|
|
# log_facility = LOG_LOCAL0
|
|
# log_level = INFO
|
|
# log_address = /dev/log
|
|
# The following caps the length of log lines to the value given; no limit if
|
|
# set to 0, the default.
|
|
# log_max_line_length = 0
|
|
#
|
|
# comma separated list of functions to call to setup custom log handlers.
|
|
# functions get passed: conf, name, log_to_console, log_route, fmt, logger,
|
|
# adapted_logger
|
|
# log_custom_handlers =
|
|
#
|
|
# If set, log_udp_host will override log_address
|
|
# log_udp_host =
|
|
# log_udp_port = 514
|
|
#
|
|
# You can enable StatsD logging here:
|
|
# log_statsd_host = localhost
|
|
# log_statsd_port = 8125
|
|
# log_statsd_default_sample_rate = 1.0
|
|
# log_statsd_sample_rate_factor = 1.0
|
|
# log_statsd_metric_prefix =
|
|
#
|
|
# eventlet_debug = false
|
|
#
|
|
# You can set fallocate_reserve to the number of bytes you'd like fallocate to
|
|
# reserve, whether there is space for the given file size or not.
|
|
# fallocate_reserve = 0
|
|
#
|
|
# Time to wait while attempting to connect to another backend node.
|
|
# conn_timeout = 0.5
|
|
# Time to wait while sending each chunk of data to another backend node.
|
|
# node_timeout = 3
|
|
# Time to wait while receiving each chunk of data from a client or another
|
|
# backend node.
|
|
# client_timeout = 60
|
|
#
|
|
# network_chunk_size = 65536
|
|
# disk_chunk_size = 65536
|
|
|
|
[pipeline:main]
|
|
pipeline = healthcheck recon object-server
|
|
|
|
[app:object-server]
|
|
use = egg:swift#object
|
|
# You can override the default log routing for this app here:
|
|
# set log_name = object-server
|
|
# set log_facility = LOG_LOCAL0
|
|
# set log_level = INFO
|
|
# set log_requests = true
|
|
# set log_address = /dev/log
|
|
#
|
|
# max_upload_time = 86400
|
|
# slow = 0
|
|
#
|
|
# Objects smaller than this are not evicted from the buffercache once read
|
|
# keep_cache_size = 5242880
|
|
#
|
|
# If true, objects for authenticated GET requests may be kept in buffer cache
|
|
# if small enough
|
|
# keep_cache_private = false
|
|
#
|
|
# on PUTs, sync data every n MB
|
|
# mb_per_sync = 512
|
|
#
|
|
# Comma separated list of headers that can be set in metadata on an object.
|
|
# This list is in addition to X-Object-Meta-* headers and cannot include
|
|
# Content-Type, etag, Content-Length, or deleted
|
|
# allowed_headers = Content-Disposition, Content-Encoding, X-Delete-At, X-Object-Manifest, X-Static-Large-Object
|
|
#
|
|
# auto_create_account_prefix = .
|
|
#
|
|
# A value of 0 means "don't use thread pools". A reasonable starting point is
|
|
# 4.
|
|
# threads_per_disk = 0
|
|
#
|
|
# Configure parameter for creating specific server
|
|
# To handle all verbs, including replication verbs, do not specify
|
|
# "replication_server" (this is the default). To only handle replication,
|
|
# set to a True value (e.g. "True" or "1"). To handle only non-replication
|
|
# verbs, set to "False". Unless you have a separate replication network, you
|
|
# should not specify any value for "replication_server".
|
|
# replication_server = false
|
|
#
|
|
# Set to restrict the number of concurrent incoming REPLICATION requests
|
|
# Set to 0 for unlimited
|
|
# Note that REPLICATION is currently an ssync only item
|
|
# replication_concurrency = 4
|
|
#
|
|
# Restricts incoming REPLICATION requests to one per device,
|
|
# replication_currency above allowing. This can help control I/O to each
|
|
# device, but you may wish to set this to False to allow multiple REPLICATION
|
|
# requests (up to the above replication_concurrency setting) per device.
|
|
# replication_one_per_device = True
|
|
#
|
|
# Number of seconds to wait for an existing replication device lock before
|
|
# giving up.
|
|
# replication_lock_timeout = 15
|
|
#
|
|
# These next two settings control when the REPLICATION subrequest handler will
|
|
# abort an incoming REPLICATION attempt. An abort will occur if there are at
|
|
# least threshold number of failures and the value of failures / successes
|
|
# exceeds the ratio. The defaults of 100 and 1.0 means that at least 100
|
|
# failures have to occur and there have to be more failures than successes for
|
|
# an abort to occur.
|
|
# replication_failure_threshold = 100
|
|
# replication_failure_ratio = 1.0
|
|
#
|
|
# Use splice() for zero-copy object GETs. This requires Linux kernel
|
|
# version 3.0 or greater. If you set "splice = yes" but the kernel
|
|
# does not support it, error messages will appear in the object server
|
|
# logs at startup, but your object servers should continue to function.
|
|
#
|
|
# splice = no
|
|
|
|
[filter:healthcheck]
|
|
use = egg:swift#healthcheck
|
|
# An optional filesystem path, which if present, will cause the healthcheck
|
|
# URL to return "503 Service Unavailable" with a body of "DISABLED BY FILE"
|
|
# disable_path =
|
|
|
|
[filter:recon]
|
|
use = egg:swift#recon
|
|
#recon_cache_path = /var/cache/swift
|
|
#recon_lock_path = /var/lock
|
|
|
|
[object-replicator]
|
|
# You can override the default log routing for this app here (don't use set!):
|
|
# log_name = object-replicator
|
|
# log_facility = LOG_LOCAL0
|
|
# log_level = INFO
|
|
# log_address = /dev/log
|
|
#
|
|
# vm_test_mode = no
|
|
# daemonize = on
|
|
#
|
|
# Time in seconds to wait between replication passes
|
|
# interval = 30
|
|
# run_pause is deprecated, use interval instead
|
|
# run_pause = 30
|
|
#
|
|
# concurrency = 1
|
|
# stats_interval = 300
|
|
#
|
|
# The sync method to use; default is rsync but you can use ssync to try the
|
|
# EXPERIMENTAL all-swift-code-no-rsync-callouts method. Once ssync is verified
|
|
# as having performance comparable to, or better than, rsync, we plan to
|
|
# deprecate rsync so we can move on with more features for replication.
|
|
# sync_method = rsync
|
|
#
|
|
# max duration of a partition rsync
|
|
# rsync_timeout = 900
|
|
#
|
|
# bandwidth limit for rsync in kB/s. 0 means unlimited
|
|
# rsync_bwlimit = 0
|
|
#
|
|
# passed to rsync for io op timeout
|
|
# rsync_io_timeout = 30
|
|
#
|
|
# Allow rsync to compress data which is transmitted to destination node
|
|
# during sync. However, this is applicable only when destination node is in
|
|
# a different region than the local one.
|
|
# NOTE: Objects that are already compressed (for example: .tar.gz, .mp3) might
|
|
# slow down the syncing process.
|
|
# rsync_compress = no
|
|
#
|
|
# node_timeout = <whatever's in the DEFAULT section or 10>
|
|
# max duration of an http request; this is for REPLICATE finalization calls and
|
|
# so should be longer than node_timeout
|
|
# http_timeout = 60
|
|
#
|
|
# attempts to kill all workers if nothing replicates for lockup_timeout seconds
|
|
# lockup_timeout = 1800
|
|
#
|
|
# The replicator also performs reclamation
|
|
# reclaim_age = 604800
|
|
#
|
|
# ring_check_interval = 15
|
|
# recon_cache_path = /var/cache/swift
|
|
#
|
|
# limits how long rsync error log lines are
|
|
# 0 means to log the entire line
|
|
# rsync_error_log_line_length = 0
|
|
#
|
|
# handoffs_first and handoff_delete are options for a special case
|
|
# such as disk full in the cluster. These two options SHOULD NOT BE
|
|
# CHANGED, except for such an extreme situations. (e.g. disks filled up
|
|
# or are about to fill up. Anyway, DO NOT let your drives fill up)
|
|
# handoffs_first is the flag to replicate handoffs prior to canonical
|
|
# partitions. It allows to force syncing and deleting handoffs quickly.
|
|
# If set to a True value(e.g. "True" or "1"), partitions
|
|
# that are not supposed to be on the node will be replicated first.
|
|
# handoffs_first = False
|
|
#
|
|
# handoff_delete is the number of replicas which are ensured in swift.
|
|
# If the number less than the number of replicas is set, object-replicator
|
|
# could delete local handoffs even if all replicas are not ensured in the
|
|
# cluster. Object-replicator would remove local handoff partition directories
|
|
# after syncing partition when the number of successful responses is greater
|
|
# than or equal to this number. By default(auto), handoff partitions will be
|
|
# removed when it has successfully replicated to all the canonical nodes.
|
|
# handoff_delete = auto
|
|
|
|
[object-reconstructor]
|
|
# You can override the default log routing for this app here (don't use set!):
|
|
# Unless otherwise noted, each setting below has the same meaning as described
|
|
# in the [object-replicator] section, however these settings apply to the EC
|
|
# reconstructor
|
|
#
|
|
# log_name = object-reconstructor
|
|
# log_facility = LOG_LOCAL0
|
|
# log_level = INFO
|
|
# log_address = /dev/log
|
|
#
|
|
# daemonize = on
|
|
#
|
|
# Time in seconds to wait between reconstruction passes
|
|
# interval = 30
|
|
# run_pause is deprecated, use interval instead
|
|
# run_pause = 30
|
|
#
|
|
# concurrency = 1
|
|
# stats_interval = 300
|
|
# node_timeout = 10
|
|
# http_timeout = 60
|
|
# lockup_timeout = 1800
|
|
# reclaim_age = 604800
|
|
# ring_check_interval = 15
|
|
# recon_cache_path = /var/cache/swift
|
|
# handoffs_first = False
|
|
|
|
[object-updater]
|
|
# You can override the default log routing for this app here (don't use set!):
|
|
# log_name = object-updater
|
|
# log_facility = LOG_LOCAL0
|
|
# log_level = INFO
|
|
# log_address = /dev/log
|
|
#
|
|
# interval = 300
|
|
# concurrency = 1
|
|
# node_timeout = <whatever's in the DEFAULT section or 10>
|
|
# slowdown will sleep that amount between objects
|
|
# slowdown = 0.01
|
|
#
|
|
# recon_cache_path = /var/cache/swift
|
|
|
|
[object-auditor]
|
|
# You can override the default log routing for this app here (don't use set!):
|
|
# log_name = object-auditor
|
|
# log_facility = LOG_LOCAL0
|
|
# log_level = INFO
|
|
# log_address = /dev/log
|
|
#
|
|
# You can set the disk chunk size that the auditor uses making it larger if
|
|
# you like for more efficient local auditing of larger objects
|
|
# disk_chunk_size = 65536
|
|
# files_per_second = 20
|
|
# concurrency = 1
|
|
# bytes_per_second = 10000000
|
|
# log_time = 3600
|
|
# zero_byte_files_per_second = 50
|
|
# recon_cache_path = /var/cache/swift
|
|
|
|
# Takes a comma separated list of ints. If set, the object auditor will
|
|
# increment a counter for every object whose size is <= to the given break
|
|
# points and report the result after a full scan.
|
|
# object_size_stats =
|
|
|
|
# Note: Put it at the beginning of the pipleline to profile all middleware. But
|
|
# it is safer to put this after healthcheck.
|
|
[filter:xprofile]
|
|
use = egg:swift#xprofile
|
|
# This option enable you to switch profilers which should inherit from python
|
|
# standard profiler. Currently the supported value can be 'cProfile',
|
|
# 'eventlet.green.profile' etc.
|
|
# profile_module = eventlet.green.profile
|
|
#
|
|
# This prefix will be used to combine process ID and timestamp to name the
|
|
# profile data file. Make sure the executing user has permission to write
|
|
# into this path (missing path segments will be created, if necessary).
|
|
# If you enable profiling in more than one type of daemon, you must override
|
|
# it with an unique value like: /var/log/swift/profile/object.profile
|
|
# log_filename_prefix = /tmp/log/swift/profile/default.profile
|
|
#
|
|
# the profile data will be dumped to local disk based on above naming rule
|
|
# in this interval.
|
|
# dump_interval = 5.0
|
|
#
|
|
# Be careful, this option will enable profiler to dump data into the file with
|
|
# time stamp which means there will be lots of files piled up in the directory.
|
|
# dump_timestamp = false
|
|
#
|
|
# This is the path of the URL to access the mini web UI.
|
|
# path = /__profile__
|
|
#
|
|
# Clear the data when the wsgi server shutdown.
|
|
# flush_at_shutdown = false
|
|
#
|
|
# unwind the iterator of applications
|
|
# unwind = false
|