swift/doc/source/ops_runbook/maintenance.rst
gaofei 10542d00ea Replace Chinese punctuation with English punctuation
Curly quotes(Chinese punctuation) usually input from Chinese input method.
When read from english context, it makes some confusion.

Change-Id: Ibd50299ee287c56ec4759ea8ff53d47d006144f8
2018-01-25 05:33:04 +00:00

331 lines
16 KiB
ReStructuredText

==================
Server maintenance
==================
General assumptions
~~~~~~~~~~~~~~~~~~~
- It is assumed that anyone attempting to replace hardware components
will have already read and understood the appropriate maintenance and
service guides.
- It is assumed that where servers need to be taken off-line for
hardware replacement, that this will be done in series, bringing the
server back on-line before taking the next off-line.
- It is assumed that the operations directed procedure will be used for
identifying hardware for replacement.
Assessing the health of swift
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You can run the swift-recon tool on a Swift proxy node to get a quick
check of how Swift is doing. Please note that the numbers below are
necessarily somewhat subjective. Sometimes parameters for which we
say 'low values are good' will have pretty high values for a time. Often
if you wait a while things get better.
For example:
.. code::
sudo swift-recon -rla
===============================================================================
[2012-03-10 12:57:21] Checking async pendings on 384 hosts...
Async stats: low: 0, high: 1, avg: 0, total: 1
===============================================================================
[2012-03-10 12:57:22] Checking replication times on 384 hosts...
[Replication Times] shortest: 1.4113877813, longest: 36.8293570836, avg: 4.86278064749
===============================================================================
[2012-03-10 12:57:22] Checking load avg's on 384 hosts...
[5m load average] lowest: 2.22, highest: 9.5, avg: 4.59578125
[15m load average] lowest: 2.36, highest: 9.45, avg: 4.62622395833
[1m load average] lowest: 1.84, highest: 9.57, avg: 4.5696875
===============================================================================
In the example above we ask for information on replication times (-r),
load averages (-l) and async pendings (-a). This is a healthy Swift
system. Rules-of-thumb for 'good' recon output are:
- Nodes that respond are up and running Swift. If all nodes respond,
that is a good sign. But some nodes may time out. For example:
.. code::
-> [http://<redacted>.29:6200/recon/load:] <urlopen error [Errno 111] ECONNREFUSED>
-> [http://<redacted>.31:6200/recon/load:] <urlopen error timed out>
- That could be okay or could require investigation.
- Low values (say < 10 for high and average) for async pendings are
good. Higher values occur when disks are down and/or when the system
is heavily loaded. Many simultaneous PUTs to the same container can
drive async pendings up. This may be normal, and may resolve itself
after a while. If it persists, one way to track down the problem is
to find a node with high async pendings (with ``swift-recon -av | sort
-n -k4``), then check its Swift logs, Often async pendings are high
because a node cannot write to a container on another node. Often
this is because the node or disk is offline or bad. This may be okay
if we know about it.
- Low values for replication times are good. These values rise when new
rings are pushed, and when nodes and devices are brought back on
line.
- Our 'high' load average values are typically in the 9-15 range. If
they are a lot bigger it is worth having a look at the systems
pushing the average up. Run ``swift-recon -av`` to get the individual
averages. To sort the entries with the highest at the end,
run ``swift-recon -av | sort -n -k4``.
For comparison here is the recon output for the same system above when
two entire racks of Swift are down:
.. code::
[2012-03-10 16:56:33] Checking async pendings on 384 hosts...
-> http://<redacted>.22:6200/recon/async: <urlopen error timed out>
-> http://<redacted>.18:6200/recon/async: <urlopen error timed out>
-> http://<redacted>.16:6200/recon/async: <urlopen error timed out>
-> http://<redacted>.13:6200/recon/async: <urlopen error timed out>
-> http://<redacted>.30:6200/recon/async: <urlopen error timed out>
-> http://<redacted>.6:6200/recon/async: <urlopen error timed out>
.........
-> http://<redacted>.5:6200/recon/async: <urlopen error timed out>
-> http://<redacted>.15:6200/recon/async: <urlopen error timed out>
-> http://<redacted>.9:6200/recon/async: <urlopen error timed out>
-> http://<redacted>.27:6200/recon/async: <urlopen error timed out>
-> http://<redacted>.4:6200/recon/async: <urlopen error timed out>
-> http://<redacted>.8:6200/recon/async: <urlopen error timed out>
Async stats: low: 243, high: 659, avg: 413, total: 132275
===============================================================================
[2012-03-10 16:57:48] Checking replication times on 384 hosts...
-> http://<redacted>.22:6200/recon/replication: <urlopen error timed out>
-> http://<redacted>.18:6200/recon/replication: <urlopen error timed out>
-> http://<redacted>.16:6200/recon/replication: <urlopen error timed out>
-> http://<redacted>.13:6200/recon/replication: <urlopen error timed out>
-> http://<redacted>.30:6200/recon/replication: <urlopen error timed out>
-> http://<redacted>.6:6200/recon/replication: <urlopen error timed out>
............
-> http://<redacted>.5:6200/recon/replication: <urlopen error timed out>
-> http://<redacted>.15:6200/recon/replication: <urlopen error timed out>
-> http://<redacted>.9:6200/recon/replication: <urlopen error timed out>
-> http://<redacted>.27:6200/recon/replication: <urlopen error timed out>
-> http://<redacted>.4:6200/recon/replication: <urlopen error timed out>
-> http://<redacted>.8:6200/recon/replication: <urlopen error timed out>
[Replication Times] shortest: 1.38144306739, longest: 112.620954418, avg: 10.285
9475361
===============================================================================
[2012-03-10 16:59:03] Checking load avg's on 384 hosts...
-> http://<redacted>.22:6200/recon/load: <urlopen error timed out>
-> http://<redacted>.18:6200/recon/load: <urlopen error timed out>
-> http://<redacted>.16:6200/recon/load: <urlopen error timed out>
-> http://<redacted>.13:6200/recon/load: <urlopen error timed out>
-> http://<redacted>.30:6200/recon/load: <urlopen error timed out>
-> http://<redacted>.6:6200/recon/load: <urlopen error timed out>
............
-> http://<redacted>.15:6200/recon/load: <urlopen error timed out>
-> http://<redacted>.9:6200/recon/load: <urlopen error timed out>
-> http://<redacted>.27:6200/recon/load: <urlopen error timed out>
-> http://<redacted>.4:6200/recon/load: <urlopen error timed out>
-> http://<redacted>.8:6200/recon/load: <urlopen error timed out>
[5m load average] lowest: 1.71, highest: 4.91, avg: 2.486375
[15m load average] lowest: 1.79, highest: 5.04, avg: 2.506125
[1m load average] lowest: 1.46, highest: 4.55, avg: 2.4929375
===============================================================================
.. note::
The replication times and load averages are within reasonable
parameters, even with 80 object stores down. Async pendings, however is
quite high. This is due to the fact that the containers on the servers
which are down cannot be updated. When those servers come back up, async
pendings should drop. If async pendings were at this level without an
explanation, we have a problem.
Recon examples
~~~~~~~~~~~~~~
Here is an example of noting and tracking down a problem with recon.
Running reccon shows some async pendings:
.. code::
bob@notso:~/swift-1.4.4/swift$ ssh -q <redacted>.132.7 sudo swift-recon -alr
===============================================================================
[2012-03-14 17:25:55] Checking async pendings on 384 hosts...
Async stats: low: 0, high: 23, avg: 8, total: 3356
===============================================================================
[2012-03-14 17:25:55] Checking replication times on 384 hosts...
[Replication Times] shortest: 1.49303831657, longest: 39.6982825994, avg: 4.2418222066
===============================================================================
[2012-03-14 17:25:56] Checking load avg's on 384 hosts...
[5m load average] lowest: 2.35, highest: 8.88, avg: 4.45911458333
[15m load average] lowest: 2.41, highest: 9.11, avg: 4.504765625
[1m load average] lowest: 1.95, highest: 8.56, avg: 4.40588541667
===============================================================================
Why? Running recon again with -av swift (not shown here) tells us that
the node with the highest (23) is <redacted>.72.61. Looking at the log
files on <redacted>.72.61 we see:
.. code::
souzab@<redacted>:~$ sudo tail -f /var/log/swift/background.log | - grep -i ERROR
Mar 14 17:28:06 <redacted> container-replicator ERROR Remote drive not mounted
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.119', 'id': 5481, 'meta': '', 'device': 'disk6', 'port': 6201}
Mar 14 17:28:06 <redacted> container-replicator ERROR Remote drive not mounted
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.119', 'id': 5481, 'meta': '', 'device': 'disk6', 'port': 6201}
Mar 14 17:28:09 <redacted> container-replicator ERROR Remote drive not mounted
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6201}
Mar 14 17:28:11 <redacted> container-replicator ERROR Remote drive not mounted
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6201}
Mar 14 17:28:13 <redacted> container-replicator ERROR Remote drive not mounted
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.119', 'id': 5481, 'meta': '', 'device': 'disk6', 'port': 6201}
Mar 14 17:28:13 <redacted> container-replicator ERROR Remote drive not mounted
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.119', 'id': 5481, 'meta': '', 'device': 'disk6', 'port': 6201}
Mar 14 17:28:15 <redacted> container-replicator ERROR Remote drive not mounted
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6201}
Mar 14 17:28:15 <redacted> container-replicator ERROR Remote drive not mounted
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6201}
Mar 14 17:28:19 <redacted> container-replicator ERROR Remote drive not mounted
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6201}
Mar 14 17:28:19 <redacted> container-replicator ERROR Remote drive not mounted
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6201}
Mar 14 17:28:20 <redacted> container-replicator ERROR Remote drive not mounted
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.119', 'id': 5481, 'meta': '', 'device': 'disk6', 'port': 6201}
Mar 14 17:28:21 <redacted> container-replicator ERROR Remote drive not mounted
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6201}
Mar 14 17:28:21 <redacted> container-replicator ERROR Remote drive not mounted
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6201}
Mar 14 17:28:22 <redacted> container-replicator ERROR Remote drive not mounted
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6201}
That is why this node has a lot of async pendings: a bunch of disks that
are not mounted on <redacted> and <redacted>. There may be other issues,
but clearing this up will likely drop the async pendings a fair bit, as
other nodes will be having the same problem.
Assessing the availability risk when multiple storage servers are down
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. note::
This procedure will tell you if you have a problem, however, in practice
you will find that you will not use this procedure frequently.
If three storage nodes (or, more precisely, three disks on three
different storage nodes) are down, there is a small but nonzero
probability that user objects, containers, or accounts will not be
available.
Procedure
---------
.. note::
swift has three rings: one each for objects, containers and accounts.
This procedure should be run three times, each time specifying the
appropriate ``*.builder`` file.
#. Determine whether all three nodes are in different Swift zones by
running the ring builder on a proxy node to determine which zones
the storage nodes are in. For example:
.. code::
% sudo swift-ring-builder /etc/swift/object.builder
/etc/swift/object.builder, build version 1467
2097152 partitions, 3 replicas, 5 zones, 1320 devices, 0.02 balance
The minimum number of hours before a partition can be reassigned is 24
Devices: id zone ip address port name weight partitions balance meta
0 1 <redacted>.4 6200 disk0 1708.00 4259 -0.00
1 1 <redacted>.4 6200 disk1 1708.00 4260 0.02
2 1 <redacted>.4 6200 disk2 1952.00 4868 0.01
3 1 <redacted>.4 6200 disk3 1952.00 4868 0.01
4 1 <redacted>.4 6200 disk4 1952.00 4867 -0.01
#. Here, node <redacted>.4 is in zone 1. If two or more of the three
nodes under consideration are in the same Swift zone, they do not
have any ring partitions in common; there is little/no data
availability risk if all three nodes are down.
#. If the nodes are in three distinct Swift zones it is necessary to
whether the nodes have ring partitions in common. Run ``swift-ring``
builder again, this time with the ``list_parts`` option and specify
the nodes under consideration. For example:
.. code::
% sudo swift-ring-builder /etc/swift/object.builder list_parts <redacted>.8 <redacted>.15 <redacted>.72.2
Partition Matches
91 2
729 2
3754 2
3769 2
3947 2
5818 2
7918 2
8733 2
9509 2
10233 2
#. The ``list_parts`` option to the ring builder indicates how many ring
partitions the nodes have in common. If, as in this case, the
first entry in the list has a 'Matches' column of 2 or less, there
is no data availability risk if all three nodes are down.
#. If the 'Matches' column has entries equal to 3, there is some data
availability risk if all three nodes are down. The risk is generally
small, and is proportional to the number of entries that have a 3 in
the Matches column. For example:
.. code::
Partition Matches
26865 3
362367 3
745940 3
778715 3
797559 3
820295 3
822118 3
839603 3
852332 3
855965 3
858016 3
#. A quick way to count the number of rows with 3 matches is:
.. code::
% sudo swift-ring-builder /etc/swift/object.builder list_parts <redacted>.8 <redacted>.15 <redacted>.72.2 | grep "3$" | wc -l
30
#. In this case the nodes have 30 out of a total of 2097152 partitions
in common; about 0.001%. In this case the risk is small/nonzero.
Recall that a partition is simply a portion of the ring mapping
space, not actual data. So having partitions in common is a necessary
but not sufficient condition for data unavailability.
.. note::
We should not bring down a node for repair if it shows
Matches entries of 3 with other nodes that are also down.
If three nodes that have 3 partitions in common are all down, there is
a nonzero probability that data are unavailable and we should work to
bring some or all of the nodes up ASAP.
Swift startup/shutdown
~~~~~~~~~~~~~~~~~~~~~~
- Use reload - not stop/start/restart.
- Try to roll sets of servers (especially proxy) in groups of less
than 20% of your servers.