Currently, when object-server serves GET request and DiskFile
reader iterate over disk file chunks, there is no explicit
eventlet sleep called. When network outpace the slow disk IO,
it's possible one large and slow GET request could cause
eventlet hub not to schedule any other green threads for a
long period of time. To improve this, this patch add a
configurable sleep parameter into DiskFile reader, which
is 'cooperative_period' with a default value of 0 (disabled).
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Change-Id: I80b04bad0601b6cd6caef35498f89d4ba70a4fd4
Object GET requests with a truthy X-Newest header are not resumed if a
backend request times out. The GetOrHeadHandler therefore uses the
regular node_timeout when waiting for a backend connection response,
rather than the possibly shorter recoverable_node_timeout. However,
previously while reading data from a backend response the
recoverable_node_timeout would still be used with X-Newest requests.
This patch simplifies GetOrHeadHandler to never use
recoverable_node_timeout when X-Newest is truthy.
Change-Id: I326278ecb21465f519b281c9f6c2dedbcbb5ff14
The allowed_digests option were added to the formpost middleware in
addition to the tempurl middleware[1], but the option was not added to
the formpost section in the example proxy config file.
[1] 2d063cd61f6915579840a41ac0248a26085e0245
Change-Id: Ic885e8bde7c1bbb3d93d032080b591db1de80970
Currently, SLO manifest files will be evicted from page cache
after reading it, which cause hard drives very busy when user
requests a lot of parallel byte range GETs for a particular
SLO object.
This patch will add a new config 'keep_cache_slo_manifest', and
try keeping the manifest files in page cache by not evicting them
after reading if config settings allow so.
Co-Authored-By: Tim Burke <tim.burke@gmail.com>
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Co-Authored-By: Alistair Coles <alistairncoles@gmail.com>
Change-Id: I557bd01643375d7ad68c3031430899b85908a54f
Clients sometimes hold open connections "just in case" they might later
pipeline requests. This can cause issues for proxies, especially if
operators restrict max_clients in an effort to improve response times
for the requests that *do* get serviced.
Add a new keepalive_timeout option to give proxies a way to drop these
established-but-idle connections without impacting active connections
(as may happen when reducing client_timeout). Note that this requires
eventlet 0.33.4 or later.
Change-Id: Ib5bb84fa3f8a4b9c062d58c8d3689e7030d9feb3
The internal client is suppose to be internal to the cluster, and as
such we rely on it to not remove any headers we decide to send. However
if the allow_modify_pipeline option is set the gatekeeper middleware is
added to the internal client's proxy pipeline.
So firstly, this patch removes the allow_modify_pipeline option from the
internal client constructor. And when calling loadapp
allow_modify_pipeline is always passed with a False.
Further, an op could directly put the gatekeeper middleware into the
internal client config. The internal client constructor will now check
the pipeline and raise a ValueError if one has been placed in the
pipeline.
To do this, there is now a check_gatekeeper_loaded staticmethod that will
walk the pipeline which called from the InternalClient.__init__ method.
Enabling this walking through the pipeline, we are now stashing the wsgi
pipeline in each filter so that we don't have to rely on 'app' naming
conventions to iterate the pipeline.
Co-Authored-By: Alistair Coles <alistairncoles@gmail.com>
Change-Id: Idcca7ac0796935c8883de9084d612d64159d9f92
Reseller admins can set new headers on accounts like
X-Account-Quota-Bytes-Policy-<policy-name>: <quota>
This may be done to limit consumption of a faster, all-flash policy, for
example.
This is independent of the existing X-Account-Meta-Quota-Bytes header, which
continues to limit the total storage for an account across all policies.
Change-Id: Ib25c2f667e5b81301f8c67375644981a13487cfe
This patch add a configurable timeout after which the sharder
will warn if a container DB has not completed sharding.
The new config is container_sharding_timeout with a default of
172800 seconds (2 days).
Drive-by fix: recording sharding progress will cover the case
of shard range shrinking too.
Co-Authored-By: Alistair Coles <alistairncoles@gmail.com>
Change-Id: I6ce299b5232a8f394e35f148317f9e08208a0c0f
If you've got thousands of requests per second for objects in a single
container, you basically NEVER want that container's info to ever fall
out of memcache. If it *does*, all those clients are almost certainly
going to overload the container.
Avoid this by allowing some small fraction of requests to bypass and
refresh the cache, pushing out the TTL as long as there continue to be
requests to the container. The likelihood of skipping the cache is
configurable, similar to what we did for shard range sets.
Change-Id: If9249a42b30e2a2e7c4b0b91f947f24bf891b86f
Closes-Bug: #1883324
Currently the object-replicator has an option called `handoff_delete`
which allows us to define the the number of replicas which are ensured
in swift. Once a handoff node ensures that many successful responses it
can go ahead and delete the handoff partition.
By default it's 'auto' or rather the number of primary nodes. But this
can be reduced. It's useful in draining full disks, but has to be used
carefully.
This patch adds the same option to the DB replicator and works the same
way. But instead of deleting a partition it's done at the per DB level.
Because it's done in the DB Replicator level it means the option is now
available to both the Account and Container replicators.
Change-Id: Ide739a6d805bda20071c7977f5083574a5345a33
This will be used when finding their own devices in rings, defaulting to
the bind_ip.
Notably, this allows services to be containerized while servers_per_port
is enabled:
* For the object-server, the ring_ip should be set to the host ip and
will be used to discover which ports need binding. Sockets will still
be bound to the bind_ip (likely 0.0.0.0), with the assumption that the
host will publish ports 1:1.
* For the replicator and reconstructor, the ring_ip will be used to
discover which devices should be replicated. While bind_ip could
previously be used for this, it would have required a separate config
from the object-server.
Also rename object deamon's bind_ip attribute to ring_ip so that it's
more obvious wherever we're using the IP for ring lookups instead of
socket binding.
Co-Authored-By: Tim Burke <tim.burke@gmail.com>
Change-Id: I1c9bb8086994f7930acd8cda8f56e766938c2218
This is a fairly blunt tool: ratelimiting is per device and
applied independently in each worker, but this at least provides
some limit to disk IO on backend servers.
GET, HEAD, PUT, POST, DELETE, UPDATE and REPLICATE methods may be
rate-limited.
Only requests with a path starting '<device>/<partition>', where
<partition> can be cast to an integer, will be rate-limited. Other
requests, including, for example, recon requests with paths such as
'recon/version', are unconditionally forwarded to the next app in the
pipeline.
OPTIONS and SSYNC methods are not rate-limited. Note that
SSYNC sub-requests are passed directly to the object server app
and will not pass though this middleware.
Change-Id: I78b59a081698a6bff0d74cbac7525e28f7b5d7c1
The sample rsyslog.conf file doesn't include some container services
and object services. This change adds these services so that all daemon
services are listed.
Change-Id: Ica45b86d5b4da4e3ffc334c86bd383bebe7e7d5d
- Drop log level for successful rsyncs to debug; ops don't usually care.
- Add an option to skip "send" lines entirely -- in a large cluster,
during a meaningful expansion, there's too much information getting
logged; it's just wasting disk space.
Note that we already have similar filtering for directory creation;
that's been present since the initial commit of Swift code.
Drive-by: make it a little more clear that more than one suffix was
likely replicated when logging about success.
Change-Id: I02ba67e77e3378b2c2c8c682d5d230d31cd1bfa9
We said this would be going away back in 1.7.0 -- lets actually remove it.
Change-Id: I9742dd907abea86da9259740d913924bb1ce73e7
Related-Change: Id7d6d547b103b4f23ebf5be98b88f09ec6027ce4
Previously, objects updates that could not be sent immediately due to
per-container/bucket ratelimiting [1] would be skipped and re-tried
during the next updater cycle. There could potentially be a period of
time at the end of a cycle when the updater slept, having completed a
sweep of the on-disk async pending files, despite having skipped
updates during the cycle. Skipped updates would then be read from disk
again during the next cycle.
With this change the updater will defer skipped updates to an
in-memory queue (up to a configurable maximum number) until the sweep
of async pending files has completed, and then trickle out deferred
updates until the cycle's interval expires. This increases the useful
work done in the current cycle and reduces the amount of repeated disk
IO during the next cycle.
The deferrals queue is bounded in size and will evict least recently
read updates in order to accept more recently read updates. This
reduces the probablility that a deferred update has been made obsolete
by newer on-disk async pending files while waiting in the deferrals
queue.
The deferrals queue is implemented as a collection of per-bucket
queues so that updates can be drained from the queues in the order
that buckets cease to be ratelimited.
[1] Related-Change: Idef25cd6026b02c1b5c10a9816c8c6cbe505e7ed
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Co-Authored-By: Matthew Oliver <matt@oliver.net.au>
Change-Id: I95e58df9f15c5f9d552b8f4c4989a474f52262f4
Whenever an item is set which is larger than item_size_warning_threshold
then a warning is logged in the form:
'Item size larger than warning threshold: 2048576 (2Mi) >= 1000000 (977Ki)'
Setting the value to -1 (default) will turn off the warning.
Change-Id: I1fb50844d6b9571efaab8ac67705b2fc1fe93e25
Several headers and query params were previously revealed in logs but
are now redacted:
* X-Auth-Token header (previously redacted in the {auth_token} field,
but not the {headers} field)
* temp_url_sig query param (used by tempurl middleware)
* Authorization header and X-Amz-Signature and Signature query
parameters (used by s3api middleware)
This patch adds some new middleware helper methods to track headers and
query parameters that should be redacted by proxy-logging. While
instantiating the middleware, authors can call either:
register_sensitive_header('case-insensitive-header-name')
register_sensitive_param('case-sensitive-query-param-name')
to add items that should be redacted. The redaction uses proxy-logging's
existing reveal_sensitive_prefix config option to determine how much to
reveal.
Note that query params will still be logged in their entirety if
eventlet_debug is enabled.
UpgradeImpact
=============
The reveal_sensitive_prefix config option now applies to more items;
operators should review their currently-configured value to ensure it
is appropriate for these new contexts. In particular, operators should
consider reducing the value if it is more than 20 or so, even if that
previously offered sufficient protection for auth tokens.
Co-Authored-By: Tim Burke <tim.burke@gmail.com>
Closes-Bug: #1685798
Change-Id: I88b8cfd30292325e0870029058da6fb38026ae1a
By having some small portion of calls skip cache and go straight to
disk, we can ensure the cache is always kept fresh and never expires (at
least, for active containers). Previously, when shard ranges fell out of
cache there would frequently be a thundering herd that could overwhelm
the container server, leading to 503s served to clients or an increase
in async pendings.
Include metrics for hit/miss/skip rates.
Change-Id: I6d74719fb41665f787375a08184c1969c86ce2cf
Related-Bug: #1883324
Sometimes a cluster might be accessible via more than one set
of domain names. Allow operators to configure them such that
virtual-host style requests work with all names.
Change-Id: I83b2fded44000bf04f558e2deb6553565d54fd4a
Modify the 'log_name' option in the InternalClient wsgi config for the
following services: container-sharder, container-reconciler,
container-deleter, container-sync and object-expirer.
Previously the 'log_name' value for all internal client instances
sharing a single internal-client.conf file took the value configured
in the conf file, or would default to 'swift'. This resulted in no
distinction between logs from each internal client, and no association
with the service using a particular internal client.
With this change the 'log_name' value will typically be <log_route>-ic
where <log_route> is the service's conf file section name. For
example, 'container-sharder-ic'.
Note: any 'log_name' value configured in an internal client conf file
will now be ignored for these services unless the option key is
preceded by 'set'.
Note: by default, the logger's StatdsClient uses the log_name as its
tail_prefix when composing metrics' names. However, the proxy-logging
middleware overrides the tail_prefix with the hard-coded value
'proxy-server'. This change to log_name therefore does not change the
statsd metric names emitted by the internal client's proxy-logging.
This patch does not change the logging of the services themselves,
just their internal clients.
Change-Id: I844381fb9e1f3462043d27eb93e3fa188b206d05
Related-Change: Ida39ec7eb02a93cf4b2aa68fc07b7f0ae27b5439
Throw our stream of async_pendings through a hash ring; if the virtual
bucket gets hot just start leaving the updates on the floor and move on.
It's off by default; and if you use it you're probably going to leave a
bunch of async updates pointed at a small set of containers in the queue
for the next sweep every sweep (so maybe turn it off at some point)
Co-Authored-By: Alistair Coles <alistairncoles@gmail.com>
Change-Id: Idef25cd6026b02c1b5c10a9816c8c6cbe505e7ed
We've had this option for a year now, and it seems to help. Let's enable
it for everyone. Note that Swift clients still need to opt into the
async delete via a query param, while S3 clients get it for free.
Change-Id: Ib4164f877908b855ce354cc722d9cb0be8be9921
Previously the ssync Sender would attempt to revert all objects in a
partition within a single SSYNC request. With this change the
reconstructor daemon option max_objects_per_revert can be used to limit
the number of objects reverted inside a single SSYNC request for revert
type jobs i.e. when reverting handoff partitions.
If more than max_objects_per_revert are available, the remaining objects
will remain in the sender partition and will not be reverted until the
next call to ssync.Sender, which would currrently be the next time the
reconstructor visits that handoff partition.
Note that the option only applies to handoff revert jobs, not to sync
jobs.
Change-Id: If81760c80a4692212e3774e73af5ce37c02e8aff
This patch continues work for more of the "Consistent and
Secure Default Policies". We already have system scope
personas implemented, but the architecture people are asking
for project scope now. At least we don't need domain scope.
Change-Id: If7d39ac0dfbe991d835b76eb79ae978fc2fd3520