2015-10-22 10:19:49 +02:00
|
|
|
|
2010-07-23 17:15:29 -05:00
|
|
|
Deployment Guide
|
|
|
|
================
|
|
|
|
|
2020-11-05 11:47:48 +00:00
|
|
|
This document provides general guidance for deploying and configuring Swift.
|
|
|
|
Detailed descriptions of configuration options can be found in the
|
|
|
|
:doc:`configuration documentation <config/index>`.
|
|
|
|
|
2010-07-23 17:15:29 -05:00
|
|
|
-----------------------
|
|
|
|
Hardware Considerations
|
|
|
|
-----------------------
|
|
|
|
|
2021-01-07 10:08:36 +00:00
|
|
|
Swift is designed to run on commodity hardware. RAID on the storage drives is
|
|
|
|
not required and not recommended. Swift's disk usage pattern is the worst case
|
|
|
|
possible for RAID, and performance degrades very quickly using RAID 5 or 6.
|
2010-07-23 17:15:29 -05:00
|
|
|
|
|
|
|
------------------
|
|
|
|
Deployment Options
|
|
|
|
------------------
|
|
|
|
|
2016-07-07 21:24:52 +00:00
|
|
|
The Swift services run completely autonomously, which provides for a lot of
|
|
|
|
flexibility when architecting the hardware deployment for Swift. The 4 main
|
2010-07-23 17:15:29 -05:00
|
|
|
services are:
|
|
|
|
|
|
|
|
#. Proxy Services
|
|
|
|
#. Object Services
|
|
|
|
#. Container Services
|
|
|
|
#. Account Services
|
|
|
|
|
|
|
|
The Proxy Services are more CPU and network I/O intensive. If you are using
|
|
|
|
10g networking to the proxy, or are terminating SSL traffic at the proxy,
|
|
|
|
greater CPU power will be required.
|
|
|
|
|
|
|
|
The Object, Container, and Account Services (Storage Services) are more disk
|
|
|
|
and network I/O intensive.
|
|
|
|
|
|
|
|
The easiest deployment is to install all services on each server. There is
|
|
|
|
nothing wrong with doing this, as it scales each service out horizontally.
|
|
|
|
|
2021-01-07 10:08:36 +00:00
|
|
|
Alternatively, one set of servers may be dedicated to the Proxy Services and a
|
|
|
|
different set of servers dedicated to the Storage Services. This allows faster
|
|
|
|
networking to be configured to the proxy than the storage servers, and keeps
|
|
|
|
load balancing to the proxies more manageable. Storage Services scale out
|
|
|
|
horizontally as storage servers are added, and the overall API throughput can
|
|
|
|
be scaled by adding more proxies.
|
2010-07-23 17:15:29 -05:00
|
|
|
|
|
|
|
If you need more throughput to either Account or Container Services, they may
|
|
|
|
each be deployed to their own servers. For example you might use faster (but
|
|
|
|
more expensive) SAS or even SSD drives to get faster disk I/O to the databases.
|
|
|
|
|
2016-05-31 11:27:43 -07:00
|
|
|
A high-availability (HA) deployment of Swift requires that multiple proxy
|
|
|
|
servers are deployed and requests are load-balanced between them. Each proxy
|
|
|
|
server instance is stateless and able to respond to requests for the entire
|
|
|
|
cluster.
|
|
|
|
|
2010-09-30 15:50:20 -05:00
|
|
|
Load balancing and network design is left as an exercise to the reader,
|
2010-07-23 17:15:29 -05:00
|
|
|
but this is a very important part of the cluster, so time should be spent
|
|
|
|
designing the network for a Swift cluster.
|
|
|
|
|
2013-03-04 23:38:48 +02:00
|
|
|
|
|
|
|
---------------------
|
|
|
|
Web Front End Options
|
|
|
|
---------------------
|
|
|
|
|
|
|
|
Swift comes with an integral web front end. However, it can also be deployed
|
2013-04-19 14:15:15 -04:00
|
|
|
as a request processor of an Apache2 using mod_wsgi as described in
|
2013-03-04 23:38:48 +02:00
|
|
|
:doc:`Apache Deployment Guide <apache_deployment_guide>`.
|
|
|
|
|
2010-07-30 14:57:20 -05:00
|
|
|
.. _ring-preparing:
|
|
|
|
|
2010-07-23 17:15:29 -05:00
|
|
|
------------------
|
|
|
|
Preparing the Ring
|
|
|
|
------------------
|
|
|
|
|
|
|
|
The first step is to determine the number of partitions that will be in the
|
|
|
|
ring. We recommend that there be a minimum of 100 partitions per drive to
|
2010-09-30 15:50:20 -05:00
|
|
|
insure even distribution across the drives. A good starting point might be
|
2010-07-23 17:15:29 -05:00
|
|
|
to figure out the maximum number of drives the cluster will contain, and then
|
|
|
|
multiply by 100, and then round up to the nearest power of two.
|
|
|
|
|
|
|
|
For example, imagine we are building a cluster that will have no more than
|
|
|
|
5,000 drives. That would mean that we would have a total number of 500,000
|
|
|
|
partitions, which is pretty close to 2^19, rounded up.
|
|
|
|
|
2010-09-01 21:42:24 -05:00
|
|
|
It is also a good idea to keep the number of partitions small (relatively).
|
2010-07-23 17:15:29 -05:00
|
|
|
The more partitions there are, the more work that has to be done by the
|
|
|
|
replicators and other backend jobs and the more memory the rings consume in
|
|
|
|
process. The goal is to find a good balance between small rings and maximum
|
|
|
|
cluster size.
|
|
|
|
|
|
|
|
The next step is to determine the number of replicas to store of the data.
|
|
|
|
Currently it is recommended to use 3 (as this is the only value that has
|
|
|
|
been tested). The higher the number, the more storage that is used but the
|
|
|
|
less likely you are to lose data.
|
|
|
|
|
|
|
|
It is also important to determine how many zones the cluster should have. It is
|
|
|
|
recommended to start with a minimum of 5 zones. You can start with fewer, but
|
|
|
|
our testing has shown that having at least five zones is optimal when failures
|
2010-09-01 21:42:24 -05:00
|
|
|
occur. We also recommend trying to configure the zones at as high a level as
|
2010-07-23 17:15:29 -05:00
|
|
|
possible to create as much isolation as possible. Some example things to take
|
|
|
|
into consideration can include physical location, power availability, and
|
|
|
|
network connectivity. For example, in a small cluster you might decide to
|
|
|
|
split the zones up by cabinet, with each cabinet having its own power and
|
|
|
|
network connectivity. The zone concept is very abstract, so feel free to use
|
2016-05-23 21:10:08 +00:00
|
|
|
it in whatever way best isolates your data from failure. Each zone exists
|
|
|
|
in a region.
|
|
|
|
|
|
|
|
A region is also an abstract concept that may be used to distinguish between
|
|
|
|
geographically separated areas as well as can be used within same datacenter.
|
|
|
|
Regions and zones are referenced by a positive integer.
|
2010-07-23 17:15:29 -05:00
|
|
|
|
|
|
|
You can now start building the ring with::
|
|
|
|
|
2010-08-13 11:19:43 -04:00
|
|
|
swift-ring-builder <builder_file> create <part_power> <replicas> <min_part_hours>
|
2010-07-23 17:15:29 -05:00
|
|
|
|
2012-06-07 20:19:31 +00:00
|
|
|
This will start the ring build process creating the <builder_file> with
|
2010-07-23 17:15:29 -05:00
|
|
|
2^<part_power> partitions. <min_part_hours> is the time in hours before a
|
|
|
|
specific partition can be moved in succession (24 is a good value for this).
|
|
|
|
|
|
|
|
Devices can be added to the ring with::
|
|
|
|
|
2016-05-20 16:26:15 +00:00
|
|
|
swift-ring-builder <builder_file> add r<region>z<zone>-<ip>:<port>/<device_name>_<meta> <weight>
|
2010-07-23 17:15:29 -05:00
|
|
|
|
|
|
|
This will add a device to the ring where <builder_file> is the name of the
|
2016-05-20 16:26:15 +00:00
|
|
|
builder file that was created previously, <region> is the number of the region
|
|
|
|
the zone is in, <zone> is the number of the zone this device is in, <ip> is
|
|
|
|
the ip address of the server the device is in, <port> is the port number that
|
|
|
|
the server is running on, <device_name> is the name of the device on the server
|
|
|
|
(for example: sdb1), <meta> is a string of metadata for the device (optional),
|
|
|
|
and <weight> is a float weight that determines how many partitions are put on
|
|
|
|
the device relative to the rest of the devices in the cluster (a good starting
|
|
|
|
point is 100.0 x TB on the drive).Add each device that will be initially in the
|
|
|
|
cluster.
|
2010-07-23 17:15:29 -05:00
|
|
|
|
|
|
|
Once all of the devices are added to the ring, run::
|
|
|
|
|
2010-08-13 11:19:43 -04:00
|
|
|
swift-ring-builder <builder_file> rebalance
|
2010-07-23 17:15:29 -05:00
|
|
|
|
|
|
|
This will distribute the partitions across the drives in the ring. It is
|
|
|
|
important whenever making changes to the ring to make all the changes
|
|
|
|
required before running rebalance. This will ensure that the ring stays as
|
|
|
|
balanced as possible, and as few partitions are moved as possible.
|
|
|
|
|
2010-09-01 21:42:24 -05:00
|
|
|
The above process should be done to make a ring for each storage service
|
2010-07-23 17:15:29 -05:00
|
|
|
(Account, Container and Object). The builder files will be needed in future
|
|
|
|
changes to the ring, so it is very important that these be kept and backed up.
|
|
|
|
The resulting .tar.gz ring file should be pushed to all of the servers in the
|
|
|
|
cluster. For more information about building rings, running
|
2010-08-13 11:19:43 -04:00
|
|
|
swift-ring-builder with no options will display help text with available
|
2010-07-23 17:15:29 -05:00
|
|
|
commands and options. More information on how the ring works internally
|
|
|
|
can be found in the :doc:`Ring Overview <overview_ring>`.
|
|
|
|
|
Allow 1+ object-servers-per-disk deployment
Enabled by a new > 0 integer config value, "servers_per_port" in the
[DEFAULT] config section for object-server and/or replication server
configs. The setting's integer value determines how many different
object-server workers handle requests for any single unique local port
in the ring. In this mode, the parent swift-object-server process
continues to run as the original user (i.e. root if low-port binding
is required), binds to all ports as defined in the ring, and forks off
the specified number of workers per listen socket. The child, per-port
servers drop privileges and behave pretty much how object-server workers
always have, except that because the ring has unique ports per disk, the
object-servers will only be handling requests for a single disk. The
parent process detects dead servers and restarts them (with the correct
listen socket), starts missing servers when an updated ring file is
found with a device on the server with a new port, and kills extraneous
servers when their port is found to no longer be in the ring. The ring
files are stat'ed at most every "ring_check_interval" seconds, as
configured in the object-server config (same default of 15s).
Immediately stopping all swift-object-worker processes still works by
sending the parent a SIGTERM. Likewise, a SIGHUP to the parent process
still causes the parent process to close all listen sockets and exit,
allowing existing children to finish serving their existing requests.
The drop_privileges helper function now has an optional param to
suppress the setsid() call, which otherwise screws up the child workers'
process management.
The class method RingData.load() can be told to only load the ring
metadata (i.e. everything except replica2part2dev_id) with the optional
kwarg, header_only=True. This is used to keep the parent and all
forked off workers from unnecessarily having full copies of all storage
policy rings in memory.
A new helper class, swift.common.storage_policy.BindPortsCache,
provides a method to return a set of all device ports in all rings for
the server on which it is instantiated (identified by its set of IP
addresses). The BindPortsCache instance will track mtimes of ring
files, so they are not opened more frequently than necessary.
This patch includes enhancements to the probe tests and
object-replicator/object-reconstructor config plumbing to allow the
probe tests to work correctly both in the "normal" config (same IP but
unique ports for each SAIO "server") and a server-per-port setup where
each SAIO "server" must have a unique IP address and unique port per
disk within each "server". The main probe tests only work with 4
servers and 4 disks, but you can see the difference in the rings for the
EC probe tests where there are 2 disks per server for a total of 8
disks. Specifically, swift.common.ring.utils.is_local_device() will
ignore the ports when the "my_port" argument is None. Then,
object-replicator and object-reconstructor both set self.bind_port to
None if server_per_port is enabled. Bonus improvement for IPv6
addresses in is_local_device().
This PR for vagrant-swift-all-in-one will aid in testing this patch:
https://github.com/swiftstack/vagrant-swift-all-in-one/pull/16/
Also allow SAIO to answer is_local_device() better; common SAIO setups
have multiple "servers" all on the same host with different ports for
the different "servers" (which happen to match the IPs specified in the
rings for the devices on each of those "servers").
However, you can configure the SAIO to have different localhost IP
addresses (e.g. 127.0.0.1, 127.0.0.2, etc.) in the ring and in the
servers' config files' bind_ip setting.
This new whataremyips() implementation combined with a little plumbing
allows is_local_device() to accurately answer, even on an SAIO.
In the default case (an unspecified bind_ip defaults to '0.0.0.0') as
well as an explict "bind to everything" like '0.0.0.0' or '::',
whataremyips() behaves as it always has, returning all IP addresses for
the server.
Also updated probe tests to handle each "server" in the SAIO having a
unique IP address.
For some (noisy) benchmarks that show servers_per_port=X is at least as
good as the same number of "normal" workers:
https://gist.github.com/dbishop/c214f89ca708a6b1624a#file-summary-md
Benchmarks showing the benefits of I/O isolation with a small number of
slow disks:
https://gist.github.com/dbishop/fd0ab067babdecfb07ca#file-results-md
If you were wondering what the overhead of threads_per_disk looks like:
https://gist.github.com/dbishop/1d14755fedc86a161718#file-tabular_results-md
DocImpact
Change-Id: I2239a4000b41a7e7cc53465ce794af49d44796c6
2015-05-14 22:14:15 -07:00
|
|
|
.. _server-per-port-configuration:
|
|
|
|
|
|
|
|
-------------------------------
|
|
|
|
Running object-servers Per Disk
|
|
|
|
-------------------------------
|
|
|
|
|
|
|
|
The lack of true asynchronous file I/O on Linux leaves the object-server
|
|
|
|
workers vulnerable to misbehaving disks. Because any object-server worker can
|
|
|
|
service a request for any disk, and a slow I/O request blocks the eventlet hub,
|
|
|
|
a single slow disk can impair an entire storage node. This also prevents
|
|
|
|
object servers from fully utilizing all their disks during heavy load.
|
|
|
|
|
|
|
|
Another way to get full I/O isolation is to give each disk on a storage node a
|
|
|
|
different port in the storage policy rings. Then set the
|
|
|
|
:ref:`servers_per_port <object-server-default-options>`
|
|
|
|
option in the object-server config. NOTE: while the purpose of this config
|
|
|
|
setting is to run one or more object-server worker processes per *disk*, the
|
|
|
|
implementation just runs object-servers per unique port of local devices in the
|
|
|
|
rings. The deployer must combine this option with appropriately-configured
|
|
|
|
rings to benefit from this feature.
|
|
|
|
|
|
|
|
Here's an example (abbreviated) old-style ring (2 node cluster with 2 disks
|
|
|
|
each)::
|
|
|
|
|
|
|
|
Devices: id region zone ip address port replication ip replication port name
|
2016-02-01 18:06:54 +00:00
|
|
|
0 1 1 1.1.0.1 6200 1.1.0.1 6200 d1
|
|
|
|
1 1 1 1.1.0.1 6200 1.1.0.1 6200 d2
|
|
|
|
2 1 2 1.1.0.2 6200 1.1.0.2 6200 d3
|
|
|
|
3 1 2 1.1.0.2 6200 1.1.0.2 6200 d4
|
Allow 1+ object-servers-per-disk deployment
Enabled by a new > 0 integer config value, "servers_per_port" in the
[DEFAULT] config section for object-server and/or replication server
configs. The setting's integer value determines how many different
object-server workers handle requests for any single unique local port
in the ring. In this mode, the parent swift-object-server process
continues to run as the original user (i.e. root if low-port binding
is required), binds to all ports as defined in the ring, and forks off
the specified number of workers per listen socket. The child, per-port
servers drop privileges and behave pretty much how object-server workers
always have, except that because the ring has unique ports per disk, the
object-servers will only be handling requests for a single disk. The
parent process detects dead servers and restarts them (with the correct
listen socket), starts missing servers when an updated ring file is
found with a device on the server with a new port, and kills extraneous
servers when their port is found to no longer be in the ring. The ring
files are stat'ed at most every "ring_check_interval" seconds, as
configured in the object-server config (same default of 15s).
Immediately stopping all swift-object-worker processes still works by
sending the parent a SIGTERM. Likewise, a SIGHUP to the parent process
still causes the parent process to close all listen sockets and exit,
allowing existing children to finish serving their existing requests.
The drop_privileges helper function now has an optional param to
suppress the setsid() call, which otherwise screws up the child workers'
process management.
The class method RingData.load() can be told to only load the ring
metadata (i.e. everything except replica2part2dev_id) with the optional
kwarg, header_only=True. This is used to keep the parent and all
forked off workers from unnecessarily having full copies of all storage
policy rings in memory.
A new helper class, swift.common.storage_policy.BindPortsCache,
provides a method to return a set of all device ports in all rings for
the server on which it is instantiated (identified by its set of IP
addresses). The BindPortsCache instance will track mtimes of ring
files, so they are not opened more frequently than necessary.
This patch includes enhancements to the probe tests and
object-replicator/object-reconstructor config plumbing to allow the
probe tests to work correctly both in the "normal" config (same IP but
unique ports for each SAIO "server") and a server-per-port setup where
each SAIO "server" must have a unique IP address and unique port per
disk within each "server". The main probe tests only work with 4
servers and 4 disks, but you can see the difference in the rings for the
EC probe tests where there are 2 disks per server for a total of 8
disks. Specifically, swift.common.ring.utils.is_local_device() will
ignore the ports when the "my_port" argument is None. Then,
object-replicator and object-reconstructor both set self.bind_port to
None if server_per_port is enabled. Bonus improvement for IPv6
addresses in is_local_device().
This PR for vagrant-swift-all-in-one will aid in testing this patch:
https://github.com/swiftstack/vagrant-swift-all-in-one/pull/16/
Also allow SAIO to answer is_local_device() better; common SAIO setups
have multiple "servers" all on the same host with different ports for
the different "servers" (which happen to match the IPs specified in the
rings for the devices on each of those "servers").
However, you can configure the SAIO to have different localhost IP
addresses (e.g. 127.0.0.1, 127.0.0.2, etc.) in the ring and in the
servers' config files' bind_ip setting.
This new whataremyips() implementation combined with a little plumbing
allows is_local_device() to accurately answer, even on an SAIO.
In the default case (an unspecified bind_ip defaults to '0.0.0.0') as
well as an explict "bind to everything" like '0.0.0.0' or '::',
whataremyips() behaves as it always has, returning all IP addresses for
the server.
Also updated probe tests to handle each "server" in the SAIO having a
unique IP address.
For some (noisy) benchmarks that show servers_per_port=X is at least as
good as the same number of "normal" workers:
https://gist.github.com/dbishop/c214f89ca708a6b1624a#file-summary-md
Benchmarks showing the benefits of I/O isolation with a small number of
slow disks:
https://gist.github.com/dbishop/fd0ab067babdecfb07ca#file-results-md
If you were wondering what the overhead of threads_per_disk looks like:
https://gist.github.com/dbishop/1d14755fedc86a161718#file-tabular_results-md
DocImpact
Change-Id: I2239a4000b41a7e7cc53465ce794af49d44796c6
2015-05-14 22:14:15 -07:00
|
|
|
|
2020-08-25 15:24:24 -07:00
|
|
|
And here's the same ring set up for ``servers_per_port``::
|
Allow 1+ object-servers-per-disk deployment
Enabled by a new > 0 integer config value, "servers_per_port" in the
[DEFAULT] config section for object-server and/or replication server
configs. The setting's integer value determines how many different
object-server workers handle requests for any single unique local port
in the ring. In this mode, the parent swift-object-server process
continues to run as the original user (i.e. root if low-port binding
is required), binds to all ports as defined in the ring, and forks off
the specified number of workers per listen socket. The child, per-port
servers drop privileges and behave pretty much how object-server workers
always have, except that because the ring has unique ports per disk, the
object-servers will only be handling requests for a single disk. The
parent process detects dead servers and restarts them (with the correct
listen socket), starts missing servers when an updated ring file is
found with a device on the server with a new port, and kills extraneous
servers when their port is found to no longer be in the ring. The ring
files are stat'ed at most every "ring_check_interval" seconds, as
configured in the object-server config (same default of 15s).
Immediately stopping all swift-object-worker processes still works by
sending the parent a SIGTERM. Likewise, a SIGHUP to the parent process
still causes the parent process to close all listen sockets and exit,
allowing existing children to finish serving their existing requests.
The drop_privileges helper function now has an optional param to
suppress the setsid() call, which otherwise screws up the child workers'
process management.
The class method RingData.load() can be told to only load the ring
metadata (i.e. everything except replica2part2dev_id) with the optional
kwarg, header_only=True. This is used to keep the parent and all
forked off workers from unnecessarily having full copies of all storage
policy rings in memory.
A new helper class, swift.common.storage_policy.BindPortsCache,
provides a method to return a set of all device ports in all rings for
the server on which it is instantiated (identified by its set of IP
addresses). The BindPortsCache instance will track mtimes of ring
files, so they are not opened more frequently than necessary.
This patch includes enhancements to the probe tests and
object-replicator/object-reconstructor config plumbing to allow the
probe tests to work correctly both in the "normal" config (same IP but
unique ports for each SAIO "server") and a server-per-port setup where
each SAIO "server" must have a unique IP address and unique port per
disk within each "server". The main probe tests only work with 4
servers and 4 disks, but you can see the difference in the rings for the
EC probe tests where there are 2 disks per server for a total of 8
disks. Specifically, swift.common.ring.utils.is_local_device() will
ignore the ports when the "my_port" argument is None. Then,
object-replicator and object-reconstructor both set self.bind_port to
None if server_per_port is enabled. Bonus improvement for IPv6
addresses in is_local_device().
This PR for vagrant-swift-all-in-one will aid in testing this patch:
https://github.com/swiftstack/vagrant-swift-all-in-one/pull/16/
Also allow SAIO to answer is_local_device() better; common SAIO setups
have multiple "servers" all on the same host with different ports for
the different "servers" (which happen to match the IPs specified in the
rings for the devices on each of those "servers").
However, you can configure the SAIO to have different localhost IP
addresses (e.g. 127.0.0.1, 127.0.0.2, etc.) in the ring and in the
servers' config files' bind_ip setting.
This new whataremyips() implementation combined with a little plumbing
allows is_local_device() to accurately answer, even on an SAIO.
In the default case (an unspecified bind_ip defaults to '0.0.0.0') as
well as an explict "bind to everything" like '0.0.0.0' or '::',
whataremyips() behaves as it always has, returning all IP addresses for
the server.
Also updated probe tests to handle each "server" in the SAIO having a
unique IP address.
For some (noisy) benchmarks that show servers_per_port=X is at least as
good as the same number of "normal" workers:
https://gist.github.com/dbishop/c214f89ca708a6b1624a#file-summary-md
Benchmarks showing the benefits of I/O isolation with a small number of
slow disks:
https://gist.github.com/dbishop/fd0ab067babdecfb07ca#file-results-md
If you were wondering what the overhead of threads_per_disk looks like:
https://gist.github.com/dbishop/1d14755fedc86a161718#file-tabular_results-md
DocImpact
Change-Id: I2239a4000b41a7e7cc53465ce794af49d44796c6
2015-05-14 22:14:15 -07:00
|
|
|
|
|
|
|
Devices: id region zone ip address port replication ip replication port name
|
2016-02-01 18:06:54 +00:00
|
|
|
0 1 1 1.1.0.1 6200 1.1.0.1 6200 d1
|
|
|
|
1 1 1 1.1.0.1 6201 1.1.0.1 6201 d2
|
|
|
|
2 1 2 1.1.0.2 6200 1.1.0.2 6200 d3
|
|
|
|
3 1 2 1.1.0.2 6201 1.1.0.2 6201 d4
|
Allow 1+ object-servers-per-disk deployment
Enabled by a new > 0 integer config value, "servers_per_port" in the
[DEFAULT] config section for object-server and/or replication server
configs. The setting's integer value determines how many different
object-server workers handle requests for any single unique local port
in the ring. In this mode, the parent swift-object-server process
continues to run as the original user (i.e. root if low-port binding
is required), binds to all ports as defined in the ring, and forks off
the specified number of workers per listen socket. The child, per-port
servers drop privileges and behave pretty much how object-server workers
always have, except that because the ring has unique ports per disk, the
object-servers will only be handling requests for a single disk. The
parent process detects dead servers and restarts them (with the correct
listen socket), starts missing servers when an updated ring file is
found with a device on the server with a new port, and kills extraneous
servers when their port is found to no longer be in the ring. The ring
files are stat'ed at most every "ring_check_interval" seconds, as
configured in the object-server config (same default of 15s).
Immediately stopping all swift-object-worker processes still works by
sending the parent a SIGTERM. Likewise, a SIGHUP to the parent process
still causes the parent process to close all listen sockets and exit,
allowing existing children to finish serving their existing requests.
The drop_privileges helper function now has an optional param to
suppress the setsid() call, which otherwise screws up the child workers'
process management.
The class method RingData.load() can be told to only load the ring
metadata (i.e. everything except replica2part2dev_id) with the optional
kwarg, header_only=True. This is used to keep the parent and all
forked off workers from unnecessarily having full copies of all storage
policy rings in memory.
A new helper class, swift.common.storage_policy.BindPortsCache,
provides a method to return a set of all device ports in all rings for
the server on which it is instantiated (identified by its set of IP
addresses). The BindPortsCache instance will track mtimes of ring
files, so they are not opened more frequently than necessary.
This patch includes enhancements to the probe tests and
object-replicator/object-reconstructor config plumbing to allow the
probe tests to work correctly both in the "normal" config (same IP but
unique ports for each SAIO "server") and a server-per-port setup where
each SAIO "server" must have a unique IP address and unique port per
disk within each "server". The main probe tests only work with 4
servers and 4 disks, but you can see the difference in the rings for the
EC probe tests where there are 2 disks per server for a total of 8
disks. Specifically, swift.common.ring.utils.is_local_device() will
ignore the ports when the "my_port" argument is None. Then,
object-replicator and object-reconstructor both set self.bind_port to
None if server_per_port is enabled. Bonus improvement for IPv6
addresses in is_local_device().
This PR for vagrant-swift-all-in-one will aid in testing this patch:
https://github.com/swiftstack/vagrant-swift-all-in-one/pull/16/
Also allow SAIO to answer is_local_device() better; common SAIO setups
have multiple "servers" all on the same host with different ports for
the different "servers" (which happen to match the IPs specified in the
rings for the devices on each of those "servers").
However, you can configure the SAIO to have different localhost IP
addresses (e.g. 127.0.0.1, 127.0.0.2, etc.) in the ring and in the
servers' config files' bind_ip setting.
This new whataremyips() implementation combined with a little plumbing
allows is_local_device() to accurately answer, even on an SAIO.
In the default case (an unspecified bind_ip defaults to '0.0.0.0') as
well as an explict "bind to everything" like '0.0.0.0' or '::',
whataremyips() behaves as it always has, returning all IP addresses for
the server.
Also updated probe tests to handle each "server" in the SAIO having a
unique IP address.
For some (noisy) benchmarks that show servers_per_port=X is at least as
good as the same number of "normal" workers:
https://gist.github.com/dbishop/c214f89ca708a6b1624a#file-summary-md
Benchmarks showing the benefits of I/O isolation with a small number of
slow disks:
https://gist.github.com/dbishop/fd0ab067babdecfb07ca#file-results-md
If you were wondering what the overhead of threads_per_disk looks like:
https://gist.github.com/dbishop/1d14755fedc86a161718#file-tabular_results-md
DocImpact
Change-Id: I2239a4000b41a7e7cc53465ce794af49d44796c6
2015-05-14 22:14:15 -07:00
|
|
|
|
2020-08-25 15:24:24 -07:00
|
|
|
When migrating from normal to ``servers_per_port``, perform these steps in order:
|
Allow 1+ object-servers-per-disk deployment
Enabled by a new > 0 integer config value, "servers_per_port" in the
[DEFAULT] config section for object-server and/or replication server
configs. The setting's integer value determines how many different
object-server workers handle requests for any single unique local port
in the ring. In this mode, the parent swift-object-server process
continues to run as the original user (i.e. root if low-port binding
is required), binds to all ports as defined in the ring, and forks off
the specified number of workers per listen socket. The child, per-port
servers drop privileges and behave pretty much how object-server workers
always have, except that because the ring has unique ports per disk, the
object-servers will only be handling requests for a single disk. The
parent process detects dead servers and restarts them (with the correct
listen socket), starts missing servers when an updated ring file is
found with a device on the server with a new port, and kills extraneous
servers when their port is found to no longer be in the ring. The ring
files are stat'ed at most every "ring_check_interval" seconds, as
configured in the object-server config (same default of 15s).
Immediately stopping all swift-object-worker processes still works by
sending the parent a SIGTERM. Likewise, a SIGHUP to the parent process
still causes the parent process to close all listen sockets and exit,
allowing existing children to finish serving their existing requests.
The drop_privileges helper function now has an optional param to
suppress the setsid() call, which otherwise screws up the child workers'
process management.
The class method RingData.load() can be told to only load the ring
metadata (i.e. everything except replica2part2dev_id) with the optional
kwarg, header_only=True. This is used to keep the parent and all
forked off workers from unnecessarily having full copies of all storage
policy rings in memory.
A new helper class, swift.common.storage_policy.BindPortsCache,
provides a method to return a set of all device ports in all rings for
the server on which it is instantiated (identified by its set of IP
addresses). The BindPortsCache instance will track mtimes of ring
files, so they are not opened more frequently than necessary.
This patch includes enhancements to the probe tests and
object-replicator/object-reconstructor config plumbing to allow the
probe tests to work correctly both in the "normal" config (same IP but
unique ports for each SAIO "server") and a server-per-port setup where
each SAIO "server" must have a unique IP address and unique port per
disk within each "server". The main probe tests only work with 4
servers and 4 disks, but you can see the difference in the rings for the
EC probe tests where there are 2 disks per server for a total of 8
disks. Specifically, swift.common.ring.utils.is_local_device() will
ignore the ports when the "my_port" argument is None. Then,
object-replicator and object-reconstructor both set self.bind_port to
None if server_per_port is enabled. Bonus improvement for IPv6
addresses in is_local_device().
This PR for vagrant-swift-all-in-one will aid in testing this patch:
https://github.com/swiftstack/vagrant-swift-all-in-one/pull/16/
Also allow SAIO to answer is_local_device() better; common SAIO setups
have multiple "servers" all on the same host with different ports for
the different "servers" (which happen to match the IPs specified in the
rings for the devices on each of those "servers").
However, you can configure the SAIO to have different localhost IP
addresses (e.g. 127.0.0.1, 127.0.0.2, etc.) in the ring and in the
servers' config files' bind_ip setting.
This new whataremyips() implementation combined with a little plumbing
allows is_local_device() to accurately answer, even on an SAIO.
In the default case (an unspecified bind_ip defaults to '0.0.0.0') as
well as an explict "bind to everything" like '0.0.0.0' or '::',
whataremyips() behaves as it always has, returning all IP addresses for
the server.
Also updated probe tests to handle each "server" in the SAIO having a
unique IP address.
For some (noisy) benchmarks that show servers_per_port=X is at least as
good as the same number of "normal" workers:
https://gist.github.com/dbishop/c214f89ca708a6b1624a#file-summary-md
Benchmarks showing the benefits of I/O isolation with a small number of
slow disks:
https://gist.github.com/dbishop/fd0ab067babdecfb07ca#file-results-md
If you were wondering what the overhead of threads_per_disk looks like:
https://gist.github.com/dbishop/1d14755fedc86a161718#file-tabular_results-md
DocImpact
Change-Id: I2239a4000b41a7e7cc53465ce794af49d44796c6
2015-05-14 22:14:15 -07:00
|
|
|
|
2020-08-25 15:24:24 -07:00
|
|
|
#. Upgrade Swift code to a version capable of doing ``servers_per_port``.
|
Allow 1+ object-servers-per-disk deployment
Enabled by a new > 0 integer config value, "servers_per_port" in the
[DEFAULT] config section for object-server and/or replication server
configs. The setting's integer value determines how many different
object-server workers handle requests for any single unique local port
in the ring. In this mode, the parent swift-object-server process
continues to run as the original user (i.e. root if low-port binding
is required), binds to all ports as defined in the ring, and forks off
the specified number of workers per listen socket. The child, per-port
servers drop privileges and behave pretty much how object-server workers
always have, except that because the ring has unique ports per disk, the
object-servers will only be handling requests for a single disk. The
parent process detects dead servers and restarts them (with the correct
listen socket), starts missing servers when an updated ring file is
found with a device on the server with a new port, and kills extraneous
servers when their port is found to no longer be in the ring. The ring
files are stat'ed at most every "ring_check_interval" seconds, as
configured in the object-server config (same default of 15s).
Immediately stopping all swift-object-worker processes still works by
sending the parent a SIGTERM. Likewise, a SIGHUP to the parent process
still causes the parent process to close all listen sockets and exit,
allowing existing children to finish serving their existing requests.
The drop_privileges helper function now has an optional param to
suppress the setsid() call, which otherwise screws up the child workers'
process management.
The class method RingData.load() can be told to only load the ring
metadata (i.e. everything except replica2part2dev_id) with the optional
kwarg, header_only=True. This is used to keep the parent and all
forked off workers from unnecessarily having full copies of all storage
policy rings in memory.
A new helper class, swift.common.storage_policy.BindPortsCache,
provides a method to return a set of all device ports in all rings for
the server on which it is instantiated (identified by its set of IP
addresses). The BindPortsCache instance will track mtimes of ring
files, so they are not opened more frequently than necessary.
This patch includes enhancements to the probe tests and
object-replicator/object-reconstructor config plumbing to allow the
probe tests to work correctly both in the "normal" config (same IP but
unique ports for each SAIO "server") and a server-per-port setup where
each SAIO "server" must have a unique IP address and unique port per
disk within each "server". The main probe tests only work with 4
servers and 4 disks, but you can see the difference in the rings for the
EC probe tests where there are 2 disks per server for a total of 8
disks. Specifically, swift.common.ring.utils.is_local_device() will
ignore the ports when the "my_port" argument is None. Then,
object-replicator and object-reconstructor both set self.bind_port to
None if server_per_port is enabled. Bonus improvement for IPv6
addresses in is_local_device().
This PR for vagrant-swift-all-in-one will aid in testing this patch:
https://github.com/swiftstack/vagrant-swift-all-in-one/pull/16/
Also allow SAIO to answer is_local_device() better; common SAIO setups
have multiple "servers" all on the same host with different ports for
the different "servers" (which happen to match the IPs specified in the
rings for the devices on each of those "servers").
However, you can configure the SAIO to have different localhost IP
addresses (e.g. 127.0.0.1, 127.0.0.2, etc.) in the ring and in the
servers' config files' bind_ip setting.
This new whataremyips() implementation combined with a little plumbing
allows is_local_device() to accurately answer, even on an SAIO.
In the default case (an unspecified bind_ip defaults to '0.0.0.0') as
well as an explict "bind to everything" like '0.0.0.0' or '::',
whataremyips() behaves as it always has, returning all IP addresses for
the server.
Also updated probe tests to handle each "server" in the SAIO having a
unique IP address.
For some (noisy) benchmarks that show servers_per_port=X is at least as
good as the same number of "normal" workers:
https://gist.github.com/dbishop/c214f89ca708a6b1624a#file-summary-md
Benchmarks showing the benefits of I/O isolation with a small number of
slow disks:
https://gist.github.com/dbishop/fd0ab067babdecfb07ca#file-results-md
If you were wondering what the overhead of threads_per_disk looks like:
https://gist.github.com/dbishop/1d14755fedc86a161718#file-tabular_results-md
DocImpact
Change-Id: I2239a4000b41a7e7cc53465ce794af49d44796c6
2015-05-14 22:14:15 -07:00
|
|
|
|
2020-08-25 15:24:24 -07:00
|
|
|
#. Enable ``servers_per_port`` with a value greater than zero.
|
Allow 1+ object-servers-per-disk deployment
Enabled by a new > 0 integer config value, "servers_per_port" in the
[DEFAULT] config section for object-server and/or replication server
configs. The setting's integer value determines how many different
object-server workers handle requests for any single unique local port
in the ring. In this mode, the parent swift-object-server process
continues to run as the original user (i.e. root if low-port binding
is required), binds to all ports as defined in the ring, and forks off
the specified number of workers per listen socket. The child, per-port
servers drop privileges and behave pretty much how object-server workers
always have, except that because the ring has unique ports per disk, the
object-servers will only be handling requests for a single disk. The
parent process detects dead servers and restarts them (with the correct
listen socket), starts missing servers when an updated ring file is
found with a device on the server with a new port, and kills extraneous
servers when their port is found to no longer be in the ring. The ring
files are stat'ed at most every "ring_check_interval" seconds, as
configured in the object-server config (same default of 15s).
Immediately stopping all swift-object-worker processes still works by
sending the parent a SIGTERM. Likewise, a SIGHUP to the parent process
still causes the parent process to close all listen sockets and exit,
allowing existing children to finish serving their existing requests.
The drop_privileges helper function now has an optional param to
suppress the setsid() call, which otherwise screws up the child workers'
process management.
The class method RingData.load() can be told to only load the ring
metadata (i.e. everything except replica2part2dev_id) with the optional
kwarg, header_only=True. This is used to keep the parent and all
forked off workers from unnecessarily having full copies of all storage
policy rings in memory.
A new helper class, swift.common.storage_policy.BindPortsCache,
provides a method to return a set of all device ports in all rings for
the server on which it is instantiated (identified by its set of IP
addresses). The BindPortsCache instance will track mtimes of ring
files, so they are not opened more frequently than necessary.
This patch includes enhancements to the probe tests and
object-replicator/object-reconstructor config plumbing to allow the
probe tests to work correctly both in the "normal" config (same IP but
unique ports for each SAIO "server") and a server-per-port setup where
each SAIO "server" must have a unique IP address and unique port per
disk within each "server". The main probe tests only work with 4
servers and 4 disks, but you can see the difference in the rings for the
EC probe tests where there are 2 disks per server for a total of 8
disks. Specifically, swift.common.ring.utils.is_local_device() will
ignore the ports when the "my_port" argument is None. Then,
object-replicator and object-reconstructor both set self.bind_port to
None if server_per_port is enabled. Bonus improvement for IPv6
addresses in is_local_device().
This PR for vagrant-swift-all-in-one will aid in testing this patch:
https://github.com/swiftstack/vagrant-swift-all-in-one/pull/16/
Also allow SAIO to answer is_local_device() better; common SAIO setups
have multiple "servers" all on the same host with different ports for
the different "servers" (which happen to match the IPs specified in the
rings for the devices on each of those "servers").
However, you can configure the SAIO to have different localhost IP
addresses (e.g. 127.0.0.1, 127.0.0.2, etc.) in the ring and in the
servers' config files' bind_ip setting.
This new whataremyips() implementation combined with a little plumbing
allows is_local_device() to accurately answer, even on an SAIO.
In the default case (an unspecified bind_ip defaults to '0.0.0.0') as
well as an explict "bind to everything" like '0.0.0.0' or '::',
whataremyips() behaves as it always has, returning all IP addresses for
the server.
Also updated probe tests to handle each "server" in the SAIO having a
unique IP address.
For some (noisy) benchmarks that show servers_per_port=X is at least as
good as the same number of "normal" workers:
https://gist.github.com/dbishop/c214f89ca708a6b1624a#file-summary-md
Benchmarks showing the benefits of I/O isolation with a small number of
slow disks:
https://gist.github.com/dbishop/fd0ab067babdecfb07ca#file-results-md
If you were wondering what the overhead of threads_per_disk looks like:
https://gist.github.com/dbishop/1d14755fedc86a161718#file-tabular_results-md
DocImpact
Change-Id: I2239a4000b41a7e7cc53465ce794af49d44796c6
2015-05-14 22:14:15 -07:00
|
|
|
|
2020-08-25 15:24:24 -07:00
|
|
|
#. Restart ``swift-object-server`` processes with a SIGHUP. At this point, you
|
|
|
|
will have the ``servers_per_port`` number of ``swift-object-server`` processes
|
|
|
|
serving all requests for all disks on each node. This preserves
|
|
|
|
availability, but you should perform the next step as quickly as possible.
|
Allow 1+ object-servers-per-disk deployment
Enabled by a new > 0 integer config value, "servers_per_port" in the
[DEFAULT] config section for object-server and/or replication server
configs. The setting's integer value determines how many different
object-server workers handle requests for any single unique local port
in the ring. In this mode, the parent swift-object-server process
continues to run as the original user (i.e. root if low-port binding
is required), binds to all ports as defined in the ring, and forks off
the specified number of workers per listen socket. The child, per-port
servers drop privileges and behave pretty much how object-server workers
always have, except that because the ring has unique ports per disk, the
object-servers will only be handling requests for a single disk. The
parent process detects dead servers and restarts them (with the correct
listen socket), starts missing servers when an updated ring file is
found with a device on the server with a new port, and kills extraneous
servers when their port is found to no longer be in the ring. The ring
files are stat'ed at most every "ring_check_interval" seconds, as
configured in the object-server config (same default of 15s).
Immediately stopping all swift-object-worker processes still works by
sending the parent a SIGTERM. Likewise, a SIGHUP to the parent process
still causes the parent process to close all listen sockets and exit,
allowing existing children to finish serving their existing requests.
The drop_privileges helper function now has an optional param to
suppress the setsid() call, which otherwise screws up the child workers'
process management.
The class method RingData.load() can be told to only load the ring
metadata (i.e. everything except replica2part2dev_id) with the optional
kwarg, header_only=True. This is used to keep the parent and all
forked off workers from unnecessarily having full copies of all storage
policy rings in memory.
A new helper class, swift.common.storage_policy.BindPortsCache,
provides a method to return a set of all device ports in all rings for
the server on which it is instantiated (identified by its set of IP
addresses). The BindPortsCache instance will track mtimes of ring
files, so they are not opened more frequently than necessary.
This patch includes enhancements to the probe tests and
object-replicator/object-reconstructor config plumbing to allow the
probe tests to work correctly both in the "normal" config (same IP but
unique ports for each SAIO "server") and a server-per-port setup where
each SAIO "server" must have a unique IP address and unique port per
disk within each "server". The main probe tests only work with 4
servers and 4 disks, but you can see the difference in the rings for the
EC probe tests where there are 2 disks per server for a total of 8
disks. Specifically, swift.common.ring.utils.is_local_device() will
ignore the ports when the "my_port" argument is None. Then,
object-replicator and object-reconstructor both set self.bind_port to
None if server_per_port is enabled. Bonus improvement for IPv6
addresses in is_local_device().
This PR for vagrant-swift-all-in-one will aid in testing this patch:
https://github.com/swiftstack/vagrant-swift-all-in-one/pull/16/
Also allow SAIO to answer is_local_device() better; common SAIO setups
have multiple "servers" all on the same host with different ports for
the different "servers" (which happen to match the IPs specified in the
rings for the devices on each of those "servers").
However, you can configure the SAIO to have different localhost IP
addresses (e.g. 127.0.0.1, 127.0.0.2, etc.) in the ring and in the
servers' config files' bind_ip setting.
This new whataremyips() implementation combined with a little plumbing
allows is_local_device() to accurately answer, even on an SAIO.
In the default case (an unspecified bind_ip defaults to '0.0.0.0') as
well as an explict "bind to everything" like '0.0.0.0' or '::',
whataremyips() behaves as it always has, returning all IP addresses for
the server.
Also updated probe tests to handle each "server" in the SAIO having a
unique IP address.
For some (noisy) benchmarks that show servers_per_port=X is at least as
good as the same number of "normal" workers:
https://gist.github.com/dbishop/c214f89ca708a6b1624a#file-summary-md
Benchmarks showing the benefits of I/O isolation with a small number of
slow disks:
https://gist.github.com/dbishop/fd0ab067babdecfb07ca#file-results-md
If you were wondering what the overhead of threads_per_disk looks like:
https://gist.github.com/dbishop/1d14755fedc86a161718#file-tabular_results-md
DocImpact
Change-Id: I2239a4000b41a7e7cc53465ce794af49d44796c6
2015-05-14 22:14:15 -07:00
|
|
|
|
2020-08-25 15:24:24 -07:00
|
|
|
#. Push out new rings that actually have different ports per disk on each
|
|
|
|
server. One of the ports in the new ring should be the same as the port
|
|
|
|
used in the old ring ("6200" in the example above). This will cover
|
|
|
|
existing proxy-server processes who haven't loaded the new ring yet. They
|
|
|
|
can still talk to any storage node regardless of whether or not that
|
|
|
|
storage node has loaded the ring and started object-server processes on the
|
|
|
|
new ports.
|
Allow 1+ object-servers-per-disk deployment
Enabled by a new > 0 integer config value, "servers_per_port" in the
[DEFAULT] config section for object-server and/or replication server
configs. The setting's integer value determines how many different
object-server workers handle requests for any single unique local port
in the ring. In this mode, the parent swift-object-server process
continues to run as the original user (i.e. root if low-port binding
is required), binds to all ports as defined in the ring, and forks off
the specified number of workers per listen socket. The child, per-port
servers drop privileges and behave pretty much how object-server workers
always have, except that because the ring has unique ports per disk, the
object-servers will only be handling requests for a single disk. The
parent process detects dead servers and restarts them (with the correct
listen socket), starts missing servers when an updated ring file is
found with a device on the server with a new port, and kills extraneous
servers when their port is found to no longer be in the ring. The ring
files are stat'ed at most every "ring_check_interval" seconds, as
configured in the object-server config (same default of 15s).
Immediately stopping all swift-object-worker processes still works by
sending the parent a SIGTERM. Likewise, a SIGHUP to the parent process
still causes the parent process to close all listen sockets and exit,
allowing existing children to finish serving their existing requests.
The drop_privileges helper function now has an optional param to
suppress the setsid() call, which otherwise screws up the child workers'
process management.
The class method RingData.load() can be told to only load the ring
metadata (i.e. everything except replica2part2dev_id) with the optional
kwarg, header_only=True. This is used to keep the parent and all
forked off workers from unnecessarily having full copies of all storage
policy rings in memory.
A new helper class, swift.common.storage_policy.BindPortsCache,
provides a method to return a set of all device ports in all rings for
the server on which it is instantiated (identified by its set of IP
addresses). The BindPortsCache instance will track mtimes of ring
files, so they are not opened more frequently than necessary.
This patch includes enhancements to the probe tests and
object-replicator/object-reconstructor config plumbing to allow the
probe tests to work correctly both in the "normal" config (same IP but
unique ports for each SAIO "server") and a server-per-port setup where
each SAIO "server" must have a unique IP address and unique port per
disk within each "server". The main probe tests only work with 4
servers and 4 disks, but you can see the difference in the rings for the
EC probe tests where there are 2 disks per server for a total of 8
disks. Specifically, swift.common.ring.utils.is_local_device() will
ignore the ports when the "my_port" argument is None. Then,
object-replicator and object-reconstructor both set self.bind_port to
None if server_per_port is enabled. Bonus improvement for IPv6
addresses in is_local_device().
This PR for vagrant-swift-all-in-one will aid in testing this patch:
https://github.com/swiftstack/vagrant-swift-all-in-one/pull/16/
Also allow SAIO to answer is_local_device() better; common SAIO setups
have multiple "servers" all on the same host with different ports for
the different "servers" (which happen to match the IPs specified in the
rings for the devices on each of those "servers").
However, you can configure the SAIO to have different localhost IP
addresses (e.g. 127.0.0.1, 127.0.0.2, etc.) in the ring and in the
servers' config files' bind_ip setting.
This new whataremyips() implementation combined with a little plumbing
allows is_local_device() to accurately answer, even on an SAIO.
In the default case (an unspecified bind_ip defaults to '0.0.0.0') as
well as an explict "bind to everything" like '0.0.0.0' or '::',
whataremyips() behaves as it always has, returning all IP addresses for
the server.
Also updated probe tests to handle each "server" in the SAIO having a
unique IP address.
For some (noisy) benchmarks that show servers_per_port=X is at least as
good as the same number of "normal" workers:
https://gist.github.com/dbishop/c214f89ca708a6b1624a#file-summary-md
Benchmarks showing the benefits of I/O isolation with a small number of
slow disks:
https://gist.github.com/dbishop/fd0ab067babdecfb07ca#file-results-md
If you were wondering what the overhead of threads_per_disk looks like:
https://gist.github.com/dbishop/1d14755fedc86a161718#file-tabular_results-md
DocImpact
Change-Id: I2239a4000b41a7e7cc53465ce794af49d44796c6
2015-05-14 22:14:15 -07:00
|
|
|
|
|
|
|
If you do not run a separate object-server for replication, then this setting
|
|
|
|
must be available to the object-replicator and object-reconstructor (i.e.
|
|
|
|
appear in the [DEFAULT] config section).
|
|
|
|
|
2013-03-25 16:34:43 -07:00
|
|
|
.. _general-service-configuration:
|
|
|
|
|
|
|
|
-----------------------------
|
|
|
|
General Service Configuration
|
|
|
|
-----------------------------
|
|
|
|
|
|
|
|
Most Swift services fall into two categories. Swift's wsgi servers and
|
2014-07-10 06:21:56 -07:00
|
|
|
background daemons.
|
2013-03-25 16:34:43 -07:00
|
|
|
|
|
|
|
For more information specific to the configuration of Swift's wsgi servers
|
Allow 1+ object-servers-per-disk deployment
Enabled by a new > 0 integer config value, "servers_per_port" in the
[DEFAULT] config section for object-server and/or replication server
configs. The setting's integer value determines how many different
object-server workers handle requests for any single unique local port
in the ring. In this mode, the parent swift-object-server process
continues to run as the original user (i.e. root if low-port binding
is required), binds to all ports as defined in the ring, and forks off
the specified number of workers per listen socket. The child, per-port
servers drop privileges and behave pretty much how object-server workers
always have, except that because the ring has unique ports per disk, the
object-servers will only be handling requests for a single disk. The
parent process detects dead servers and restarts them (with the correct
listen socket), starts missing servers when an updated ring file is
found with a device on the server with a new port, and kills extraneous
servers when their port is found to no longer be in the ring. The ring
files are stat'ed at most every "ring_check_interval" seconds, as
configured in the object-server config (same default of 15s).
Immediately stopping all swift-object-worker processes still works by
sending the parent a SIGTERM. Likewise, a SIGHUP to the parent process
still causes the parent process to close all listen sockets and exit,
allowing existing children to finish serving their existing requests.
The drop_privileges helper function now has an optional param to
suppress the setsid() call, which otherwise screws up the child workers'
process management.
The class method RingData.load() can be told to only load the ring
metadata (i.e. everything except replica2part2dev_id) with the optional
kwarg, header_only=True. This is used to keep the parent and all
forked off workers from unnecessarily having full copies of all storage
policy rings in memory.
A new helper class, swift.common.storage_policy.BindPortsCache,
provides a method to return a set of all device ports in all rings for
the server on which it is instantiated (identified by its set of IP
addresses). The BindPortsCache instance will track mtimes of ring
files, so they are not opened more frequently than necessary.
This patch includes enhancements to the probe tests and
object-replicator/object-reconstructor config plumbing to allow the
probe tests to work correctly both in the "normal" config (same IP but
unique ports for each SAIO "server") and a server-per-port setup where
each SAIO "server" must have a unique IP address and unique port per
disk within each "server". The main probe tests only work with 4
servers and 4 disks, but you can see the difference in the rings for the
EC probe tests where there are 2 disks per server for a total of 8
disks. Specifically, swift.common.ring.utils.is_local_device() will
ignore the ports when the "my_port" argument is None. Then,
object-replicator and object-reconstructor both set self.bind_port to
None if server_per_port is enabled. Bonus improvement for IPv6
addresses in is_local_device().
This PR for vagrant-swift-all-in-one will aid in testing this patch:
https://github.com/swiftstack/vagrant-swift-all-in-one/pull/16/
Also allow SAIO to answer is_local_device() better; common SAIO setups
have multiple "servers" all on the same host with different ports for
the different "servers" (which happen to match the IPs specified in the
rings for the devices on each of those "servers").
However, you can configure the SAIO to have different localhost IP
addresses (e.g. 127.0.0.1, 127.0.0.2, etc.) in the ring and in the
servers' config files' bind_ip setting.
This new whataremyips() implementation combined with a little plumbing
allows is_local_device() to accurately answer, even on an SAIO.
In the default case (an unspecified bind_ip defaults to '0.0.0.0') as
well as an explict "bind to everything" like '0.0.0.0' or '::',
whataremyips() behaves as it always has, returning all IP addresses for
the server.
Also updated probe tests to handle each "server" in the SAIO having a
unique IP address.
For some (noisy) benchmarks that show servers_per_port=X is at least as
good as the same number of "normal" workers:
https://gist.github.com/dbishop/c214f89ca708a6b1624a#file-summary-md
Benchmarks showing the benefits of I/O isolation with a small number of
slow disks:
https://gist.github.com/dbishop/fd0ab067babdecfb07ca#file-results-md
If you were wondering what the overhead of threads_per_disk looks like:
https://gist.github.com/dbishop/1d14755fedc86a161718#file-tabular_results-md
DocImpact
Change-Id: I2239a4000b41a7e7cc53465ce794af49d44796c6
2015-05-14 22:14:15 -07:00
|
|
|
with paste deploy see :ref:`general-server-configuration`.
|
2013-03-25 16:34:43 -07:00
|
|
|
|
|
|
|
Configuration for servers and daemons can be expressed together in the same
|
|
|
|
file for each type of server, or separately. If a required section for the
|
|
|
|
service trying to start is missing there will be an error. The sections not
|
|
|
|
used by the service are ignored.
|
|
|
|
|
Allow 1+ object-servers-per-disk deployment
Enabled by a new > 0 integer config value, "servers_per_port" in the
[DEFAULT] config section for object-server and/or replication server
configs. The setting's integer value determines how many different
object-server workers handle requests for any single unique local port
in the ring. In this mode, the parent swift-object-server process
continues to run as the original user (i.e. root if low-port binding
is required), binds to all ports as defined in the ring, and forks off
the specified number of workers per listen socket. The child, per-port
servers drop privileges and behave pretty much how object-server workers
always have, except that because the ring has unique ports per disk, the
object-servers will only be handling requests for a single disk. The
parent process detects dead servers and restarts them (with the correct
listen socket), starts missing servers when an updated ring file is
found with a device on the server with a new port, and kills extraneous
servers when their port is found to no longer be in the ring. The ring
files are stat'ed at most every "ring_check_interval" seconds, as
configured in the object-server config (same default of 15s).
Immediately stopping all swift-object-worker processes still works by
sending the parent a SIGTERM. Likewise, a SIGHUP to the parent process
still causes the parent process to close all listen sockets and exit,
allowing existing children to finish serving their existing requests.
The drop_privileges helper function now has an optional param to
suppress the setsid() call, which otherwise screws up the child workers'
process management.
The class method RingData.load() can be told to only load the ring
metadata (i.e. everything except replica2part2dev_id) with the optional
kwarg, header_only=True. This is used to keep the parent and all
forked off workers from unnecessarily having full copies of all storage
policy rings in memory.
A new helper class, swift.common.storage_policy.BindPortsCache,
provides a method to return a set of all device ports in all rings for
the server on which it is instantiated (identified by its set of IP
addresses). The BindPortsCache instance will track mtimes of ring
files, so they are not opened more frequently than necessary.
This patch includes enhancements to the probe tests and
object-replicator/object-reconstructor config plumbing to allow the
probe tests to work correctly both in the "normal" config (same IP but
unique ports for each SAIO "server") and a server-per-port setup where
each SAIO "server" must have a unique IP address and unique port per
disk within each "server". The main probe tests only work with 4
servers and 4 disks, but you can see the difference in the rings for the
EC probe tests where there are 2 disks per server for a total of 8
disks. Specifically, swift.common.ring.utils.is_local_device() will
ignore the ports when the "my_port" argument is None. Then,
object-replicator and object-reconstructor both set self.bind_port to
None if server_per_port is enabled. Bonus improvement for IPv6
addresses in is_local_device().
This PR for vagrant-swift-all-in-one will aid in testing this patch:
https://github.com/swiftstack/vagrant-swift-all-in-one/pull/16/
Also allow SAIO to answer is_local_device() better; common SAIO setups
have multiple "servers" all on the same host with different ports for
the different "servers" (which happen to match the IPs specified in the
rings for the devices on each of those "servers").
However, you can configure the SAIO to have different localhost IP
addresses (e.g. 127.0.0.1, 127.0.0.2, etc.) in the ring and in the
servers' config files' bind_ip setting.
This new whataremyips() implementation combined with a little plumbing
allows is_local_device() to accurately answer, even on an SAIO.
In the default case (an unspecified bind_ip defaults to '0.0.0.0') as
well as an explict "bind to everything" like '0.0.0.0' or '::',
whataremyips() behaves as it always has, returning all IP addresses for
the server.
Also updated probe tests to handle each "server" in the SAIO having a
unique IP address.
For some (noisy) benchmarks that show servers_per_port=X is at least as
good as the same number of "normal" workers:
https://gist.github.com/dbishop/c214f89ca708a6b1624a#file-summary-md
Benchmarks showing the benefits of I/O isolation with a small number of
slow disks:
https://gist.github.com/dbishop/fd0ab067babdecfb07ca#file-results-md
If you were wondering what the overhead of threads_per_disk looks like:
https://gist.github.com/dbishop/1d14755fedc86a161718#file-tabular_results-md
DocImpact
Change-Id: I2239a4000b41a7e7cc53465ce794af49d44796c6
2015-05-14 22:14:15 -07:00
|
|
|
Consider the example of an object storage node. By convention, configuration
|
2016-07-25 20:10:44 +05:30
|
|
|
for the object-server, object-updater, object-replicator, object-auditor, and
|
|
|
|
object-reconstructor exist in a single file ``/etc/swift/object-server.conf``::
|
2013-03-25 16:34:43 -07:00
|
|
|
|
|
|
|
[DEFAULT]
|
2016-07-25 20:10:44 +05:30
|
|
|
reclaim_age = 604800
|
2013-03-25 16:34:43 -07:00
|
|
|
|
|
|
|
[pipeline:main]
|
|
|
|
pipeline = object-server
|
|
|
|
|
|
|
|
[app:object-server]
|
|
|
|
use = egg:swift#object
|
|
|
|
|
|
|
|
[object-replicator]
|
|
|
|
|
|
|
|
[object-updater]
|
|
|
|
|
|
|
|
[object-auditor]
|
|
|
|
|
|
|
|
Swift services expect a configuration path as the first argument::
|
|
|
|
|
2014-07-10 06:21:56 -07:00
|
|
|
$ swift-object-auditor
|
2013-03-25 16:34:43 -07:00
|
|
|
Usage: swift-object-auditor CONFIG [options]
|
|
|
|
|
|
|
|
Error: missing config path argument
|
|
|
|
|
|
|
|
If you omit the object-auditor section this file could not be used as the
|
|
|
|
configuration path when starting the ``swift-object-auditor`` daemon::
|
|
|
|
|
2014-07-10 06:21:56 -07:00
|
|
|
$ swift-object-auditor /etc/swift/object-server.conf
|
2013-03-25 16:34:43 -07:00
|
|
|
Unable to find object-auditor config section in /etc/swift/object-server.conf
|
|
|
|
|
|
|
|
If the configuration path is a directory instead of a file all of the files in
|
|
|
|
the directory with the file extension ".conf" will be combined to generate the
|
|
|
|
configuration object which is delivered to the Swift service. This is
|
|
|
|
referred to generally as "directory based configuration".
|
|
|
|
|
|
|
|
Directory based configuration leverages ConfigParser's native multi-file
|
|
|
|
support. Files ending in ".conf" in the given directory are parsed in
|
|
|
|
lexicographical order. Filenames starting with '.' are ignored. A mixture of
|
|
|
|
file and directory configuration paths is not supported - if the configuration
|
|
|
|
path is a file only that file will be parsed.
|
|
|
|
|
2016-07-07 21:24:52 +00:00
|
|
|
The Swift service management tool ``swift-init`` has adopted the convention of
|
2013-03-25 16:34:43 -07:00
|
|
|
looking for ``/etc/swift/{type}-server.conf.d/`` if the file
|
|
|
|
``/etc/swift/{type}-server.conf`` file does not exist.
|
|
|
|
|
|
|
|
When using directory based configuration, if the same option under the same
|
|
|
|
section appears more than once in different files, the last value parsed is
|
|
|
|
said to override previous occurrences. You can ensure proper override
|
|
|
|
precedence by prefixing the files in the configuration directory with
|
|
|
|
numerical values.::
|
|
|
|
|
|
|
|
/etc/swift/
|
|
|
|
default.base
|
|
|
|
object-server.conf.d/
|
|
|
|
000_default.conf -> ../default.base
|
|
|
|
001_default-override.conf
|
|
|
|
010_server.conf
|
|
|
|
020_replicator.conf
|
|
|
|
030_updater.conf
|
|
|
|
040_auditor.conf
|
|
|
|
|
|
|
|
You can inspect the resulting combined configuration object using the
|
|
|
|
``swift-config`` command line tool
|
|
|
|
|
|
|
|
.. _general-server-configuration:
|
|
|
|
|
2010-08-20 02:19:50 +00:00
|
|
|
----------------------------
|
|
|
|
General Server Configuration
|
|
|
|
----------------------------
|
|
|
|
|
2021-01-07 10:08:36 +00:00
|
|
|
Swift uses paste.deploy (https://pypi.org/project/Paste/) to manage server
|
2020-11-05 11:47:48 +00:00
|
|
|
configurations. Detailed descriptions of configuration options can be found in
|
|
|
|
the :doc:`configuration documentation <config/index>`.
|
2013-03-25 16:34:43 -07:00
|
|
|
|
2020-08-25 15:24:24 -07:00
|
|
|
Default configuration options are set in the ``[DEFAULT]`` section, and any
|
2013-03-25 16:34:43 -07:00
|
|
|
options specified there can be overridden in any of the other sections BUT
|
|
|
|
ONLY BY USING THE SYNTAX ``set option_name = value``. This is the unfortunate
|
|
|
|
way paste.deploy works and I'll try to explain it in full.
|
2011-01-23 10:42:31 -08:00
|
|
|
|
|
|
|
First, here's an example paste.deploy configuration file::
|
|
|
|
|
|
|
|
[DEFAULT]
|
|
|
|
name1 = globalvalue
|
|
|
|
name2 = globalvalue
|
|
|
|
name3 = globalvalue
|
|
|
|
set name4 = globalvalue
|
|
|
|
|
|
|
|
[pipeline:main]
|
|
|
|
pipeline = myapp
|
|
|
|
|
|
|
|
[app:myapp]
|
|
|
|
use = egg:mypkg#myapp
|
|
|
|
name2 = localvalue
|
|
|
|
set name3 = localvalue
|
|
|
|
set name5 = localvalue
|
|
|
|
name6 = localvalue
|
|
|
|
|
|
|
|
The resulting configuration that myapp receives is::
|
|
|
|
|
|
|
|
global {'__file__': '/etc/mypkg/wsgi.conf', 'here': '/etc/mypkg',
|
|
|
|
'name1': 'globalvalue',
|
|
|
|
'name2': 'globalvalue',
|
|
|
|
'name3': 'localvalue',
|
|
|
|
'name4': 'globalvalue',
|
|
|
|
'name5': 'localvalue',
|
|
|
|
'set name4': 'globalvalue'}
|
|
|
|
local {'name6': 'localvalue'}
|
|
|
|
|
2020-08-25 15:24:24 -07:00
|
|
|
So, ``name1`` got the global value which is fine since it's only in the ``DEFAULT``
|
2011-01-23 10:42:31 -08:00
|
|
|
section anyway.
|
|
|
|
|
2020-08-25 15:24:24 -07:00
|
|
|
``name2`` got the global value from ``DEFAULT`` even though it appears to be
|
|
|
|
overridden in the ``app:myapp`` subsection. This is just the unfortunate way
|
2011-01-23 10:42:31 -08:00
|
|
|
paste.deploy works (at least at the time of this writing.)
|
|
|
|
|
2020-08-25 15:24:24 -07:00
|
|
|
``name3`` got the local value from the ``app:myapp`` subsection because it is using
|
2011-01-23 10:42:31 -08:00
|
|
|
the special paste.deploy syntax of ``set option_name = value``. So, if you want
|
2015-07-24 17:09:48 +09:00
|
|
|
a default value for most app/filters but want to override it in one
|
2011-01-23 10:42:31 -08:00
|
|
|
subsection, this is how you do it.
|
|
|
|
|
2020-08-25 15:24:24 -07:00
|
|
|
``name4`` got the global value from ``DEFAULT`` since it's only in that section
|
|
|
|
anyway. But, since we used the ``set`` syntax in the ``DEFAULT`` section even
|
2011-01-23 10:42:31 -08:00
|
|
|
though we shouldn't, notice we also got a ``set name4`` variable. Weird, but
|
|
|
|
probably not harmful.
|
|
|
|
|
2020-08-25 15:24:24 -07:00
|
|
|
``name5`` got the local value from the ``app:myapp`` subsection since it's only
|
2011-01-23 10:42:31 -08:00
|
|
|
there anyway, but notice that it is in the global configuration and not the
|
|
|
|
local configuration. This is because we used the ``set`` syntax to set the
|
|
|
|
value. Again, weird, but not harmful since Swift just treats the two sets of
|
|
|
|
configuration values as one set anyway.
|
|
|
|
|
2020-08-25 15:24:24 -07:00
|
|
|
``name6`` got the local value from ``app:myapp`` subsection since it's only there,
|
2011-01-23 10:42:31 -08:00
|
|
|
and since we didn't use the ``set`` syntax, it's only in the local
|
|
|
|
configuration and not the global one. Though, as indicated above, there is no
|
|
|
|
special distinction with Swift.
|
|
|
|
|
|
|
|
That's quite an explanation for something that should be so much simpler, but
|
|
|
|
it might be important to know how paste.deploy interprets configuration files.
|
|
|
|
The main rule to remember when working with Swift configuration files is:
|
|
|
|
|
|
|
|
.. note::
|
|
|
|
|
2011-01-23 10:50:55 -08:00
|
|
|
Use the ``set option_name = value`` syntax in subsections if the option is
|
|
|
|
also set in the ``[DEFAULT]`` section. Don't get in the habit of always
|
|
|
|
using the ``set`` syntax or you'll probably mess up your non-paste.deploy
|
|
|
|
configuration files.
|
2011-01-23 10:42:31 -08:00
|
|
|
|
2010-07-30 14:57:20 -05:00
|
|
|
|
2017-05-25 11:46:05 +01:00
|
|
|
.. _proxy_server_per_policy_config:
|
|
|
|
|
2017-07-20 12:02:01 +01:00
|
|
|
************************
|
2017-03-21 18:53:13 +00:00
|
|
|
Per policy configuration
|
2017-07-20 12:02:01 +01:00
|
|
|
************************
|
2017-03-21 18:53:13 +00:00
|
|
|
|
2017-05-25 11:46:05 +01:00
|
|
|
Some proxy-server configuration options may be overridden for individual
|
|
|
|
:doc:`overview_policies` by including per-policy config section(s). These
|
|
|
|
options are:
|
2017-03-21 18:53:13 +00:00
|
|
|
|
2017-05-25 11:46:05 +01:00
|
|
|
- ``sorting_method``
|
|
|
|
- ``read_affinity``
|
|
|
|
- ``write_affinity``
|
|
|
|
- ``write_affinity_node_count``
|
2017-06-02 15:44:52 +12:00
|
|
|
- ``write_affinity_handoff_delete_count``
|
2017-03-21 18:53:13 +00:00
|
|
|
|
|
|
|
The per-policy config section name must be of the form::
|
|
|
|
|
|
|
|
[proxy-server:policy:<policy index>]
|
|
|
|
|
|
|
|
.. note::
|
|
|
|
|
|
|
|
The per-policy config section name should refer to the policy index, not
|
|
|
|
the policy name.
|
|
|
|
|
|
|
|
.. note::
|
|
|
|
|
|
|
|
The first part of proxy-server config section name must match the name of
|
|
|
|
the proxy-server config section. This is typically ``proxy-server`` as
|
|
|
|
shown above, but if different then the names of any per-policy config
|
|
|
|
sections must be changed accordingly.
|
|
|
|
|
|
|
|
The value of an option specified in a per-policy section will override any
|
|
|
|
value given in the proxy-server section for that policy only. Otherwise the
|
|
|
|
value of these options will be that specified in the proxy-server section.
|
|
|
|
|
|
|
|
For example, the following section provides policy-specific options for a
|
2017-05-25 11:46:05 +01:00
|
|
|
policy with index ``3``::
|
2017-03-21 18:53:13 +00:00
|
|
|
|
|
|
|
[proxy-server:policy:3]
|
|
|
|
sorting_method = affinity
|
|
|
|
read_affinity = r2=1
|
|
|
|
write_affinity = r2
|
|
|
|
write_affinity_node_count = 1 * replicas
|
2017-06-02 15:44:52 +12:00
|
|
|
write_affinity_handoff_delete_count = 2
|
2017-03-21 18:53:13 +00:00
|
|
|
|
|
|
|
.. note::
|
|
|
|
|
|
|
|
It is recommended that per-policy config options are *not* included in the
|
|
|
|
``[DEFAULT]`` section. If they are then the following behavior applies.
|
|
|
|
|
2017-05-25 11:46:05 +01:00
|
|
|
Per-policy config sections will inherit options in the ``[DEFAULT]``
|
|
|
|
section of the config file, and any such inheritance will take precedence
|
|
|
|
over inheriting options from the proxy-server config section.
|
2017-03-21 18:53:13 +00:00
|
|
|
|
|
|
|
Per-policy config section options will override options in the
|
|
|
|
``[DEFAULT]`` section. Unlike the behavior described under `General Server
|
|
|
|
Configuration`_ for paste-deploy ``filter`` and ``app`` sections, the
|
|
|
|
``set`` keyword is not required for options to override in per-policy
|
|
|
|
config sections.
|
|
|
|
|
|
|
|
For example, given the following settings in a config file::
|
|
|
|
|
|
|
|
[DEFAULT]
|
|
|
|
sorting_method = affinity
|
|
|
|
read_affinity = r0=100
|
|
|
|
write_affinity = r0
|
|
|
|
|
|
|
|
[app:proxy-server]
|
|
|
|
use = egg:swift#proxy
|
|
|
|
# use of set keyword here overrides [DEFAULT] option
|
|
|
|
set read_affinity = r1=100
|
|
|
|
# without set keyword, [DEFAULT] option overrides in a paste-deploy section
|
|
|
|
write_affinity = r1
|
|
|
|
|
|
|
|
[proxy-server:policy:0]
|
|
|
|
sorting_method = affinity
|
|
|
|
# set keyword not required here to override [DEFAULT] option
|
|
|
|
write_affinity = r1
|
|
|
|
|
|
|
|
would result in policy with index ``0`` having settings:
|
|
|
|
|
|
|
|
* ``read_affinity = r0=100`` (inherited from the ``[DEFAULT]`` section)
|
|
|
|
* ``write_affinity = r1`` (specified in the policy 0 section)
|
|
|
|
|
|
|
|
and any other policy would have the default settings of:
|
|
|
|
|
|
|
|
* ``read_affinity = r1=100`` (set in the proxy-server section)
|
|
|
|
* ``write_affinity = r0`` (inherited from the ``[DEFAULT]`` section)
|
|
|
|
|
2017-07-20 12:02:01 +01:00
|
|
|
*****************
|
2017-05-22 19:48:45 +00:00
|
|
|
Proxy Middlewares
|
2017-07-20 12:02:01 +01:00
|
|
|
*****************
|
2017-05-22 19:48:45 +00:00
|
|
|
|
|
|
|
Many features in Swift are implemented as middleware in the proxy-server
|
|
|
|
pipeline. See :doc:`middleware` and the ``proxy-server.conf-sample`` file for
|
|
|
|
more information. In particular, the use of some type of :doc:`authentication
|
|
|
|
and authorization middleware <overview_auth>` is highly recommended.
|
2017-03-21 18:53:13 +00:00
|
|
|
|
2012-10-01 21:43:34 -07:00
|
|
|
|
2010-07-30 14:57:20 -05:00
|
|
|
------------------------
|
|
|
|
Memcached Considerations
|
|
|
|
------------------------
|
|
|
|
|
2021-01-07 10:08:36 +00:00
|
|
|
Several of the Services rely on Memcached for caching certain types of lookups,
|
|
|
|
such as auth tokens, and container/account existence. Swift does not do any
|
|
|
|
caching of actual object data. Memcached should be able to run on any servers
|
|
|
|
that have available RAM and CPU. Typically Memcached is run on the proxy
|
|
|
|
servers. The ``memcache_servers`` config option in the ``proxy-server.conf``
|
|
|
|
should contain all memcached servers.
|
2010-07-30 14:57:20 -05:00
|
|
|
|
2021-01-07 17:06:55 +11:00
|
|
|
*************************
|
|
|
|
Shard Range Listing Cache
|
|
|
|
*************************
|
|
|
|
|
|
|
|
When a container gets :ref:`sharded<sharding_doc>` the root container will still be the
|
|
|
|
primary entry point to many container requests, as it provides the list of shards.
|
|
|
|
To take load off the root container Swift by default caches the list of shards returned.
|
|
|
|
|
|
|
|
As the number of shards for a root container grows to more than 3k the memcache default max
|
2021-01-28 13:51:01 -06:00
|
|
|
size of 1MB can be reached.
|
|
|
|
|
|
|
|
If you over-run your max configured memcache size you'll see messages like::
|
|
|
|
|
|
|
|
Error setting value in memcached: 127.0.0.1:11211: SERVER_ERROR object too large for cache
|
|
|
|
|
|
|
|
When you see these messages your root containers are getting hammered and
|
|
|
|
probably returning 503 reponses to clients. Override the default 1MB limit to
|
|
|
|
5MB with something like::
|
|
|
|
|
|
|
|
/usr/bin/memcached -I 5000000 ...
|
2021-01-07 17:06:55 +11:00
|
|
|
|
|
|
|
Memcache has a ``stats sizes`` option that can point out the current size usage. As this
|
|
|
|
reaches the current max an increase might be in order::
|
|
|
|
|
|
|
|
# telnet <memcache server> 11211
|
|
|
|
> stats sizes
|
|
|
|
STAT 160 2
|
|
|
|
STAT 448 1
|
|
|
|
STAT 576 1
|
|
|
|
END
|
|
|
|
|
|
|
|
|
2010-07-30 14:57:20 -05:00
|
|
|
-----------
|
|
|
|
System Time
|
|
|
|
-----------
|
|
|
|
|
2010-07-30 16:47:12 -05:00
|
|
|
Time may be relative but it is relatively important for Swift! Swift uses
|
2010-07-30 14:57:20 -05:00
|
|
|
timestamps to determine which is the most recent version of an object.
|
|
|
|
It is very important for the system time on each server in the cluster to
|
|
|
|
by synced as closely as possible (more so for the proxy server, but in general
|
2021-01-07 10:08:36 +00:00
|
|
|
it is a good idea for all the servers). Typical deployments use NTP with a
|
|
|
|
local NTP server to ensure that the system times are as close as possible.
|
|
|
|
This should also be monitored to ensure that the times do not vary too much.
|
2010-07-30 14:57:20 -05:00
|
|
|
|
2013-07-11 17:00:57 -07:00
|
|
|
.. _general-service-tuning:
|
|
|
|
|
2010-07-23 17:15:29 -05:00
|
|
|
----------------------
|
|
|
|
General Service Tuning
|
|
|
|
----------------------
|
|
|
|
|
2020-08-25 15:24:24 -07:00
|
|
|
Most services support either a ``workers`` or ``concurrency`` value in the
|
2013-04-19 14:15:15 -04:00
|
|
|
settings. This allows the services to make effective use of the cores
|
2021-01-07 10:08:36 +00:00
|
|
|
available. A good starting point is to set the concurrency level for the proxy
|
2013-04-19 14:15:15 -04:00
|
|
|
and storage services to 2 times the number of cores available. If more than
|
|
|
|
one service is sharing a server, then some experimentation may be needed to
|
|
|
|
find the best balance.
|
2010-07-23 17:15:29 -05:00
|
|
|
|
2021-01-07 10:08:36 +00:00
|
|
|
For example, one operator reported using the following settings in a production
|
|
|
|
Swift cluster:
|
2010-07-23 17:15:29 -05:00
|
|
|
|
2021-01-07 10:08:36 +00:00
|
|
|
- Proxy servers have dual quad core processors (i.e. 8 cores); testing has
|
|
|
|
shown 16 workers to be a pretty good balance when saturating a 10g network
|
|
|
|
and gives good CPU utilization.
|
|
|
|
|
|
|
|
- Storage server processes all run together on the same servers. These servers
|
|
|
|
have dual quad core processors, for 8 cores total. The Account, Container,
|
|
|
|
and Object servers are run with 8 workers each. Most of the background jobs
|
|
|
|
are run at a concurrency of 1, with the exception of the replicators which
|
|
|
|
are run at a concurrency of 2.
|
2013-04-19 14:15:15 -04:00
|
|
|
|
2020-08-25 15:24:24 -07:00
|
|
|
The ``max_clients`` parameter can be used to adjust the number of client
|
2013-04-19 14:15:15 -04:00
|
|
|
requests an individual worker accepts for processing. The fewer requests being
|
|
|
|
processed at one time, the less likely a request that consumes the worker's
|
|
|
|
CPU time, or blocks in the OS, will negatively impact other requests. The more
|
|
|
|
requests being processed at one time, the more likely one worker can utilize
|
|
|
|
network and disk capacity.
|
|
|
|
|
|
|
|
On systems that have more cores, and more memory, where one can afford to run
|
|
|
|
more workers, raising the number of workers and lowering the maximum number of
|
|
|
|
clients serviced per worker can lessen the impact of CPU intensive or stalled
|
|
|
|
requests.
|
2010-07-23 17:15:29 -05:00
|
|
|
|
2020-08-25 15:24:24 -07:00
|
|
|
The ``nice_priority`` parameter can be used to set program scheduling priority.
|
|
|
|
The ``ionice_class`` and ``ionice_priority`` parameters can be used to set I/O scheduling
|
2015-10-22 10:19:49 +02:00
|
|
|
class and priority on the systems that use an I/O scheduler that supports
|
|
|
|
I/O priorities. As at kernel 2.6.17 the only such scheduler is the Completely
|
|
|
|
Fair Queuing (CFQ) I/O scheduler. If you run your Storage servers all together
|
|
|
|
on the same servers, you can slow down the auditors or prioritize
|
|
|
|
object-server I/O via these parameters (but probably do not need to change
|
|
|
|
it on the proxy). It is a new feature and the best practices are still
|
2016-08-18 16:14:36 +02:00
|
|
|
being developed. On some systems it may be required to run the daemons as root.
|
|
|
|
For more info also see setpriority(2) and ioprio_set(2).
|
2015-10-22 10:19:49 +02:00
|
|
|
|
2010-07-23 17:15:29 -05:00
|
|
|
The above configuration setting should be taken as suggestions and testing
|
|
|
|
of configuration settings should be done to ensure best utilization of CPU,
|
|
|
|
network connectivity, and disk I/O.
|
|
|
|
|
|
|
|
-------------------------
|
|
|
|
Filesystem Considerations
|
|
|
|
-------------------------
|
|
|
|
|
|
|
|
Swift is designed to be mostly filesystem agnostic--the only requirement
|
2010-09-30 15:50:20 -05:00
|
|
|
being that the filesystem supports extended attributes (xattrs). After
|
2010-07-23 17:15:29 -05:00
|
|
|
thorough testing with our use cases and hardware configurations, XFS was
|
|
|
|
the best all-around choice. If you decide to use a filesystem other than
|
|
|
|
XFS, we highly recommend thorough testing.
|
|
|
|
|
2013-06-28 14:40:54 +00:00
|
|
|
For distros with more recent kernels (for example Ubuntu 12.04 Precise),
|
|
|
|
we recommend using the default settings (including the default inode size
|
|
|
|
of 256 bytes) when creating the file system::
|
2010-07-23 17:15:29 -05:00
|
|
|
|
2019-04-02 22:45:27 -05:00
|
|
|
mkfs.xfs -L D1 /dev/sda1
|
2013-06-28 14:40:54 +00:00
|
|
|
|
|
|
|
In the last couple of years, XFS has made great improvements in how inodes
|
|
|
|
are allocated and used. Using the default inode size no longer has an
|
|
|
|
impact on performance.
|
|
|
|
|
|
|
|
For distros with older kernels (for example Ubuntu 10.04 Lucid),
|
|
|
|
some settings can dramatically impact performance. We recommend the
|
|
|
|
following when creating the file system::
|
|
|
|
|
2019-04-02 22:45:27 -05:00
|
|
|
mkfs.xfs -i size=1024 -L D1 /dev/sda1
|
2010-07-23 17:15:29 -05:00
|
|
|
|
|
|
|
Setting the inode size is important, as XFS stores xattr data in the inode.
|
|
|
|
If the metadata is too large to fit in the inode, a new extent is created,
|
|
|
|
which can cause quite a performance problem. Upping the inode size to 1024
|
|
|
|
bytes provides enough room to write the default metadata, plus a little
|
2013-06-28 14:40:54 +00:00
|
|
|
headroom.
|
2010-07-23 17:15:29 -05:00
|
|
|
|
2013-06-28 14:40:54 +00:00
|
|
|
The following example mount options are recommended when using XFS::
|
2010-07-23 17:15:29 -05:00
|
|
|
|
2020-05-03 00:42:09 -07:00
|
|
|
mount -t xfs -o noatime -L D1 /srv/node/d1
|
2010-07-23 17:15:29 -05:00
|
|
|
|
2013-06-28 14:40:54 +00:00
|
|
|
We do not recommend running Swift on RAID, but if you are using
|
|
|
|
RAID it is also important to make sure that the proper sunit and swidth
|
|
|
|
settings get set so that XFS can make most efficient use of the RAID array.
|
|
|
|
|
2016-07-07 21:24:52 +00:00
|
|
|
For a standard Swift install, all data drives are mounted directly under
|
2019-04-02 22:45:27 -05:00
|
|
|
``/srv/node`` (as can be seen in the above example of mounting label ``D1``
|
|
|
|
as ``/srv/node/d1``). If you choose to mount the drives in another directory,
|
2020-08-25 15:24:24 -07:00
|
|
|
be sure to set the ``devices`` config option in all of the server configs to
|
2012-06-07 20:19:31 +00:00
|
|
|
point to the correct directory.
|
2010-07-23 17:15:29 -05:00
|
|
|
|
2015-09-14 17:17:29 -07:00
|
|
|
The mount points for each drive in ``/srv/node/`` should be owned by the root user
|
|
|
|
almost exclusively (``root:root 755``). This is required to prevent rsync from
|
2015-07-27 14:19:09 -05:00
|
|
|
syncing files into the root drive in the event a drive is unmounted.
|
|
|
|
|
2012-09-12 11:15:25 -07:00
|
|
|
Swift uses system calls to reserve space for new objects being written into
|
2020-08-25 15:24:24 -07:00
|
|
|
the system. If your filesystem does not support ``fallocate()`` or
|
|
|
|
``posix_fallocate()``, be sure to set the ``disable_fallocate = true`` config
|
2012-09-12 11:15:25 -07:00
|
|
|
parameter in account, container, and object server configs.
|
|
|
|
|
2015-06-22 14:48:41 +02:00
|
|
|
Most current Linux distributions ship with a default installation of updatedb.
|
|
|
|
This tool runs periodically and updates the file name database that is used by
|
|
|
|
the GNU locate tool. However, including Swift object and container database
|
|
|
|
files is most likely not required and the periodic update affects the
|
|
|
|
performance quite a bit. To disable the inclusion of these files add the path
|
2020-08-25 15:24:24 -07:00
|
|
|
where Swift stores its data to the setting PRUNEPATHS in ``/etc/updatedb.conf``::
|
2015-06-22 14:48:41 +02:00
|
|
|
|
|
|
|
PRUNEPATHS="... /tmp ... /var/spool ... /srv/node"
|
|
|
|
|
|
|
|
|
2010-07-23 17:15:29 -05:00
|
|
|
---------------------
|
|
|
|
General System Tuning
|
|
|
|
---------------------
|
|
|
|
|
2021-01-07 10:08:36 +00:00
|
|
|
The following changes have been found to be useful when running Swift on Ubuntu
|
|
|
|
Server 10.04.
|
2010-07-23 17:15:29 -05:00
|
|
|
|
2020-08-25 15:24:24 -07:00
|
|
|
The following settings should be in ``/etc/sysctl.conf``::
|
2010-07-23 17:15:29 -05:00
|
|
|
|
|
|
|
# disable TIME_WAIT.. wait..
|
|
|
|
net.ipv4.tcp_tw_recycle=1
|
|
|
|
net.ipv4.tcp_tw_reuse=1
|
|
|
|
|
|
|
|
# disable syn cookies
|
|
|
|
net.ipv4.tcp_syncookies = 0
|
|
|
|
|
|
|
|
# double amount of allowed conntrack
|
2021-07-21 15:00:13 -03:00
|
|
|
net.netfilter.nf_conntrack_max = 262144
|
2010-07-23 17:15:29 -05:00
|
|
|
|
2019-04-02 22:45:27 -05:00
|
|
|
To load the updated sysctl settings, run ``sudo sysctl -p``.
|
2010-07-23 17:15:29 -05:00
|
|
|
|
|
|
|
A note about changing the TIME_WAIT values. By default the OS will hold
|
|
|
|
a port open for 60 seconds to ensure that any remaining packets can be
|
|
|
|
received. During high usage, and with the number of connections that are
|
|
|
|
created, it is easy to run out of ports. We can change this since we are
|
|
|
|
in control of the network. If you are not in control of the network, or
|
|
|
|
do not expect high loads, then you may not want to adjust those values.
|
|
|
|
|
|
|
|
----------------------
|
|
|
|
Logging Considerations
|
|
|
|
----------------------
|
|
|
|
|
|
|
|
Swift is set up to log directly to syslog. Every service can be configured
|
2020-08-25 15:24:24 -07:00
|
|
|
with the ``log_facility`` option to set the syslog log facility destination.
|
2010-09-30 15:50:20 -05:00
|
|
|
We recommended using syslog-ng to route the logs to specific log
|
2010-07-23 17:15:29 -05:00
|
|
|
files locally on the server and also to remote log collecting servers.
|
2012-10-26 14:56:10 -05:00
|
|
|
Additionally, custom log handlers can be used via the custom_log_handlers
|
|
|
|
setting.
|