Cleanup of Swift Ops Runbook
This patch cleans up some rough edges that were left (due to time constraints) in the original commit. Change-Id: Id4480be8dc1b5c920c19988cb89ca8b60ace91b4 Co-Authored-By: Gerry Drudy gerry.drudy@hpe.com
This commit is contained in:
parent
643dbce134
commit
e38b53393f
@ -234,9 +234,11 @@ using the format `regex_pattern_X = regex_expression`, where `X` is a number.
|
|||||||
This script has been tested on Ubuntu 10.04 and Ubuntu 12.04, so if you are
|
This script has been tested on Ubuntu 10.04 and Ubuntu 12.04, so if you are
|
||||||
using a different distro or OS, some care should be taken before using in production.
|
using a different distro or OS, some care should be taken before using in production.
|
||||||
|
|
||||||
--------------
|
.. _dispersion_report:
|
||||||
Cluster Health
|
|
||||||
--------------
|
-----------------
|
||||||
|
Dispersion Report
|
||||||
|
-----------------
|
||||||
|
|
||||||
There is a swift-dispersion-report tool for measuring overall cluster health.
|
There is a swift-dispersion-report tool for measuring overall cluster health.
|
||||||
This is accomplished by checking if a set of deliberately distributed
|
This is accomplished by checking if a set of deliberately distributed
|
||||||
|
@ -2,15 +2,53 @@
|
|||||||
Identifying issues and resolutions
|
Identifying issues and resolutions
|
||||||
==================================
|
==================================
|
||||||
|
|
||||||
|
Is the system up?
|
||||||
|
-----------------
|
||||||
|
|
||||||
|
If you have a report that Swift is down, perform the following basic checks:
|
||||||
|
|
||||||
|
#. Run swift functional tests.
|
||||||
|
|
||||||
|
#. From a server in your data center, use ``curl`` to check ``/healthcheck``
|
||||||
|
(see below).
|
||||||
|
|
||||||
|
#. If you have a monitoring system, check your monitoring system.
|
||||||
|
|
||||||
|
#. Check your hardware load balancers infrastructure.
|
||||||
|
|
||||||
|
#. Run swift-recon on a proxy node.
|
||||||
|
|
||||||
|
Functional tests usage
|
||||||
|
-----------------------
|
||||||
|
|
||||||
|
We would recommend that you set up the functional tests to run against your
|
||||||
|
production system. Run regularly this can be a useful tool to validate
|
||||||
|
that the system is configured correctly. In addition, it can provide
|
||||||
|
early warning about failures in your system (if the functional tests stop
|
||||||
|
working, user applications will also probably stop working).
|
||||||
|
|
||||||
|
A script for running the function tests is located in ``swift/.functests``.
|
||||||
|
|
||||||
|
|
||||||
|
External monitoring
|
||||||
|
-------------------
|
||||||
|
|
||||||
|
We use pingdom.com to monitor the external Swift API. We suggest the
|
||||||
|
following:
|
||||||
|
|
||||||
|
- Do a GET on ``/healthcheck``
|
||||||
|
|
||||||
|
- Create a container, make it public (x-container-read:
|
||||||
|
.r*,.rlistings), create a small file in the container; do a GET
|
||||||
|
on the object
|
||||||
|
|
||||||
Diagnose: General approach
|
Diagnose: General approach
|
||||||
--------------------------
|
--------------------------
|
||||||
|
|
||||||
- Look at service status in your monitoring system.
|
- Look at service status in your monitoring system.
|
||||||
|
|
||||||
- In addition to system monitoring tools and issue logging by users,
|
- In addition to system monitoring tools and issue logging by users,
|
||||||
swift errors will often result in log entries in the ``/var/log/swift``
|
swift errors will often result in log entries (see :ref:`swift_logs`).
|
||||||
files: ``proxy.log``, ``server.log`` and ``background.log`` (see:``Swift
|
|
||||||
logs``).
|
|
||||||
|
|
||||||
- Look at any logs your deployment tool produces.
|
- Look at any logs your deployment tool produces.
|
||||||
|
|
||||||
@ -33,22 +71,24 @@ Diagnose: Swift-dispersion-report
|
|||||||
---------------------------------
|
---------------------------------
|
||||||
|
|
||||||
The swift-dispersion-report is a useful tool to gauge the general
|
The swift-dispersion-report is a useful tool to gauge the general
|
||||||
health of the system. Configure the ``swift-dispersion`` report for
|
health of the system. Configure the ``swift-dispersion`` report to cover at
|
||||||
100% coverage. The dispersion report regularly monitors
|
a minimum every disk drive in your system (usually 1% coverage).
|
||||||
these and gives a report of the amount of objects/containers are still
|
See :ref:`dispersion_report` for details of how to configure and
|
||||||
available as well as how many copies of them are also there.
|
use the dispersion reporting tool.
|
||||||
|
|
||||||
The dispersion-report output is logged on the first proxy of the first
|
The ``swift-dispersion-report`` tool can take a long time to run, especially
|
||||||
AZ or each system (proxy with the monitoring role) under
|
if any servers are down. We suggest you run it regularly
|
||||||
``/var/log/swift/swift-dispersion-report.log``.
|
(e.g., in a cron job) and save the results. This makes it easy to refer
|
||||||
|
to the last report without having to wait for a long-running command
|
||||||
|
to complete.
|
||||||
|
|
||||||
Diagnose: Is swift running?
|
Diagnose: Is system responding to /healthcheck?
|
||||||
---------------------------
|
-----------------------------------------------
|
||||||
|
|
||||||
When you want to establish if a swift endpoint is running, run ``curl -k``
|
When you want to establish if a swift endpoint is running, run ``curl -k``
|
||||||
against either: https://*[REPLACEABLE]*./healthcheck OR
|
against https://*[ENDPOINT]*/healthcheck.
|
||||||
https:*[REPLACEABLE]*.crossdomain.xml
|
|
||||||
|
|
||||||
|
.. _swift_logs:
|
||||||
|
|
||||||
Diagnose: Interpreting messages in ``/var/log/swift/`` files
|
Diagnose: Interpreting messages in ``/var/log/swift/`` files
|
||||||
------------------------------------------------------------
|
------------------------------------------------------------
|
||||||
@ -70,25 +110,20 @@ The following table lists known issues:
|
|||||||
- **Signature**
|
- **Signature**
|
||||||
- **Issue**
|
- **Issue**
|
||||||
- **Steps to take**
|
- **Steps to take**
|
||||||
* - /var/log/syslog
|
|
||||||
- kernel: [] hpsa .... .... .... has check condition: unknown type:
|
|
||||||
Sense: 0x5, ASC: 0x20, ASC Q: 0x0 ....
|
|
||||||
- An unsupported command was issued to the storage hardware
|
|
||||||
- Understood to be a benign monitoring issue, ignore
|
|
||||||
* - /var/log/syslog
|
* - /var/log/syslog
|
||||||
- kernel: [] sd .... [csbu:sd...] Sense Key: Medium Error
|
- kernel: [] sd .... [csbu:sd...] Sense Key: Medium Error
|
||||||
- Suggests disk surface issues
|
- Suggests disk surface issues
|
||||||
- Run swift diagnostics on the target node to check for disk errors,
|
- Run ``swift-drive-audit`` on the target node to check for disk errors,
|
||||||
repair disk errors
|
repair disk errors
|
||||||
* - /var/log/syslog
|
* - /var/log/syslog
|
||||||
- kernel: [] sd .... [csbu:sd...] Sense Key: Hardware Error
|
- kernel: [] sd .... [csbu:sd...] Sense Key: Hardware Error
|
||||||
- Suggests storage hardware issues
|
- Suggests storage hardware issues
|
||||||
- Run swift diagnostics on the target node to check for disk failures,
|
- Run diagnostics on the target node to check for disk failures,
|
||||||
replace failed disks
|
replace failed disks
|
||||||
* - /var/log/syslog
|
* - /var/log/syslog
|
||||||
- kernel: [] .... I/O error, dev sd.... ,sector ....
|
- kernel: [] .... I/O error, dev sd.... ,sector ....
|
||||||
-
|
-
|
||||||
- Run swift diagnostics on the target node to check for disk errors
|
- Run diagnostics on the target node to check for disk errors
|
||||||
* - /var/log/syslog
|
* - /var/log/syslog
|
||||||
- pound: NULL get_thr_arg
|
- pound: NULL get_thr_arg
|
||||||
- Multiple threads woke up
|
- Multiple threads woke up
|
||||||
@ -96,59 +131,61 @@ The following table lists known issues:
|
|||||||
* - /var/log/swift/proxy.log
|
* - /var/log/swift/proxy.log
|
||||||
- .... ERROR .... ConnectionTimeout ....
|
- .... ERROR .... ConnectionTimeout ....
|
||||||
- A storage node is not responding in a timely fashion
|
- A storage node is not responding in a timely fashion
|
||||||
- Run swift diagnostics on the target node to check for node down,
|
- Check if node is down, not running Swift,
|
||||||
node unconfigured, storage off-line or network issues between the
|
unconfigured, storage off-line or for network issues between the
|
||||||
proxy and non responding node
|
proxy and non responding node
|
||||||
* - /var/log/swift/proxy.log
|
* - /var/log/swift/proxy.log
|
||||||
- proxy-server .... HTTP/1.0 500 ....
|
- proxy-server .... HTTP/1.0 500 ....
|
||||||
- A proxy server has reported an internal server error
|
- A proxy server has reported an internal server error
|
||||||
- Run swift diagnostics on the target node to check for issues
|
- Examine the logs for any errors at the time the error was reported to
|
||||||
|
attempt to understand the cause of the error.
|
||||||
* - /var/log/swift/server.log
|
* - /var/log/swift/server.log
|
||||||
- .... ERROR .... ConnectionTimeout ....
|
- .... ERROR .... ConnectionTimeout ....
|
||||||
- A storage server is not responding in a timely fashion
|
- A storage server is not responding in a timely fashion
|
||||||
- Run swift diagnostics on the target node to check for a node or
|
- Check if node is down, not running Swift,
|
||||||
service, down, unconfigured, storage off-line or network issues
|
unconfigured, storage off-line or for network issues between the
|
||||||
between the two nodes
|
server and non responding node
|
||||||
* - /var/log/swift/server.log
|
* - /var/log/swift/server.log
|
||||||
- .... ERROR .... Remote I/O error: '/srv/node/disk....
|
- .... ERROR .... Remote I/O error: '/srv/node/disk....
|
||||||
- A storage device is not responding as expected
|
- A storage device is not responding as expected
|
||||||
- Run swift diagnostics and check the filesystem named in the error
|
- Run ``swift-drive-audit`` and check the filesystem named in the error
|
||||||
for corruption (unmount & xfs_repair)
|
for corruption (unmount & xfs_repair). Check if the filesystem
|
||||||
|
is mounted and working.
|
||||||
* - /var/log/swift/background.log
|
* - /var/log/swift/background.log
|
||||||
- object-server ERROR container update failed .... Connection refused
|
- object-server ERROR container update failed .... Connection refused
|
||||||
- Peer node is not responding
|
- A container server node could not be contacted
|
||||||
- Check status of the network and peer node
|
- Check if node is down, not running Swift,
|
||||||
|
unconfigured, storage off-line or for network issues between the
|
||||||
|
server and non responding node
|
||||||
* - /var/log/swift/background.log
|
* - /var/log/swift/background.log
|
||||||
- object-updater ERROR with remote .... ConnectionTimeout
|
- object-updater ERROR with remote .... ConnectionTimeout
|
||||||
-
|
- The remote container server is busy
|
||||||
- Check status of the network and peer node
|
- If the container is very large, some errors updating it can be
|
||||||
|
expected. However, this error can also occur if there is a networking
|
||||||
|
issue.
|
||||||
* - /var/log/swift/background.log
|
* - /var/log/swift/background.log
|
||||||
- account-reaper STDOUT: .... error: ECONNREFUSED
|
- account-reaper STDOUT: .... error: ECONNREFUSED
|
||||||
- Network connectivity issue
|
- Network connectivity issue or the target server is down.
|
||||||
- Resolve network issue and re-run diagnostics
|
- Resolve network issue or reboot the target server
|
||||||
* - /var/log/swift/background.log
|
* - /var/log/swift/background.log
|
||||||
- .... ERROR .... ConnectionTimeout
|
- .... ERROR .... ConnectionTimeout
|
||||||
- A storage server is not responding in a timely fashion
|
- A storage server is not responding in a timely fashion
|
||||||
- Run swift diagnostics on the target node to check for a node
|
- The target server may be busy. However, this error can also occur if
|
||||||
or service, down, unconfigured, storage off-line or network issues
|
there is a networking issue.
|
||||||
between the two nodes
|
|
||||||
* - /var/log/swift/background.log
|
* - /var/log/swift/background.log
|
||||||
- .... ERROR syncing .... Timeout
|
- .... ERROR syncing .... Timeout
|
||||||
- A storage server is not responding in a timely fashion
|
- A timeout occurred syncing data to another node.
|
||||||
- Run swift diagnostics on the target node to check for a node
|
- The target server may be busy. However, this error can also occur if
|
||||||
or service, down, unconfigured, storage off-line or network issues
|
there is a networking issue.
|
||||||
between the two nodes
|
|
||||||
* - /var/log/swift/background.log
|
* - /var/log/swift/background.log
|
||||||
- .... ERROR Remote drive not mounted ....
|
- .... ERROR Remote drive not mounted ....
|
||||||
- A storage server disk is unavailable
|
- A storage server disk is unavailable
|
||||||
- Run swift diagnostics on the target node to check for a node or
|
- Repair and remount the file system (on the remote node)
|
||||||
service, failed or unmounted disk on the target, or a network issue
|
|
||||||
* - /var/log/swift/background.log
|
* - /var/log/swift/background.log
|
||||||
- object-replicator .... responded as unmounted
|
- object-replicator .... responded as unmounted
|
||||||
- A storage server disk is unavailable
|
- A storage server disk is unavailable
|
||||||
- Run swift diagnostics on the target node to check for a node or
|
- Repair and remount the file system (on the remote node)
|
||||||
service, failed or unmounted disk on the target, or a network issue
|
* - /var/log/swift/*.log
|
||||||
* - /var/log/swift/\*.log
|
|
||||||
- STDOUT: EXCEPTION IN
|
- STDOUT: EXCEPTION IN
|
||||||
- A unexpected error occurred
|
- A unexpected error occurred
|
||||||
- Read the Traceback details, if it matches known issues
|
- Read the Traceback details, if it matches known issues
|
||||||
@ -157,19 +194,14 @@ The following table lists known issues:
|
|||||||
* - /var/log/rsyncd.log
|
* - /var/log/rsyncd.log
|
||||||
- rsync: mkdir "/disk....failed: No such file or directory....
|
- rsync: mkdir "/disk....failed: No such file or directory....
|
||||||
- A local storage server disk is unavailable
|
- A local storage server disk is unavailable
|
||||||
- Run swift diagnostics on the node to check for a failed or
|
- Run diagnostics on the node to check for a failed or
|
||||||
unmounted disk
|
unmounted disk
|
||||||
* - /var/log/swift*
|
* - /var/log/swift*
|
||||||
- Exception: Could not bind to 0.0.0.0:600xxx
|
- Exception: Could not bind to 0.0.0.0:6xxx
|
||||||
- Possible Swift process restart issue. This indicates an old swift
|
- Possible Swift process restart issue. This indicates an old swift
|
||||||
process is still running.
|
process is still running.
|
||||||
- Run swift diagnostics, if some swift services are reported down,
|
- Restart Swift services. If some swift services are reported down,
|
||||||
check if they left residual process behind.
|
check if they left residual process behind.
|
||||||
* - /var/log/rsyncd.log
|
|
||||||
- rsync: recv_generator: failed to stat "/disk....." (in object)
|
|
||||||
failed: Not a directory (20)
|
|
||||||
- Swift directory structure issues
|
|
||||||
- Run swift diagnostics on the node to check for issues
|
|
||||||
|
|
||||||
Diagnose: Parted reports the backup GPT table is corrupt
|
Diagnose: Parted reports the backup GPT table is corrupt
|
||||||
--------------------------------------------------------
|
--------------------------------------------------------
|
||||||
@ -188,7 +220,7 @@ Diagnose: Parted reports the backup GPT table is corrupt
|
|||||||
|
|
||||||
OK/Cancel?
|
OK/Cancel?
|
||||||
|
|
||||||
To fix, go to: Fix broken GPT table (broken disk partition)
|
To fix, go to :ref:`fix_broken_gpt_table`
|
||||||
|
|
||||||
|
|
||||||
Diagnose: Drives diagnostic reports a FS label is not acceptable
|
Diagnose: Drives diagnostic reports a FS label is not acceptable
|
||||||
@ -240,9 +272,10 @@ Diagnose: Failed LUNs
|
|||||||
|
|
||||||
.. note::
|
.. note::
|
||||||
|
|
||||||
The HPE Helion Public Cloud uses direct attach SmartArry
|
The HPE Helion Public Cloud uses direct attach SmartArray
|
||||||
controllers/drives. The information here is specific to that
|
controllers/drives. The information here is specific to that
|
||||||
environment.
|
environment. The hpacucli utility mentioned here may be called
|
||||||
|
hpssacli in your environment.
|
||||||
|
|
||||||
The ``swift_diagnostics`` mount checks may return a warning that a LUN has
|
The ``swift_diagnostics`` mount checks may return a warning that a LUN has
|
||||||
failed, typically accompanied by DriveAudit check failures and device
|
failed, typically accompanied by DriveAudit check failures and device
|
||||||
@ -254,7 +287,7 @@ the procedure to replace the disk.
|
|||||||
|
|
||||||
Otherwise the lun can be re-enabled as follows:
|
Otherwise the lun can be re-enabled as follows:
|
||||||
|
|
||||||
#. Generate a hpssacli diagnostic report. This report allows the swift
|
#. Generate a hpssacli diagnostic report. This report allows the DC
|
||||||
team to troubleshoot potential cabling or hardware issues so it is
|
team to troubleshoot potential cabling or hardware issues so it is
|
||||||
imperative that you run it immediately when troubleshooting a failed
|
imperative that you run it immediately when troubleshooting a failed
|
||||||
LUN. You will come back later and grep this file for more details, but
|
LUN. You will come back later and grep this file for more details, but
|
||||||
@ -262,8 +295,7 @@ Otherwise the lun can be re-enabled as follows:
|
|||||||
|
|
||||||
.. code::
|
.. code::
|
||||||
|
|
||||||
sudo hpssacli controller all diag file=/tmp/hpacu.diag ris=on \
|
sudo hpssacli controller all diag file=/tmp/hpacu.diag ris=on xml=off zip=off
|
||||||
xml=off zip=off
|
|
||||||
|
|
||||||
Export the following variables using the below instructions before
|
Export the following variables using the below instructions before
|
||||||
proceeding further.
|
proceeding further.
|
||||||
@ -317,8 +349,7 @@ proceeding further.
|
|||||||
|
|
||||||
.. code::
|
.. code::
|
||||||
|
|
||||||
sudo hpssacli controller slot=1 ld ${LDRIVE} show detail \
|
sudo hpssacli controller slot=1 ld ${LDRIVE} show detail | grep -i "Disk Name"
|
||||||
grep -i "Disk Name"
|
|
||||||
|
|
||||||
#. Export the device name variable from the preceding command (example:
|
#. Export the device name variable from the preceding command (example:
|
||||||
/dev/sdk):
|
/dev/sdk):
|
||||||
@ -396,6 +427,8 @@ proceeding further.
|
|||||||
should be checked. For example, log a DC ticket to check the sas cables
|
should be checked. For example, log a DC ticket to check the sas cables
|
||||||
between the drive and the expander.
|
between the drive and the expander.
|
||||||
|
|
||||||
|
.. _diagnose_slow_disk_drives:
|
||||||
|
|
||||||
Diagnose: Slow disk devices
|
Diagnose: Slow disk devices
|
||||||
---------------------------
|
---------------------------
|
||||||
|
|
||||||
@ -404,7 +437,8 @@ Diagnose: Slow disk devices
|
|||||||
collectl is an open-source performance gathering/analysis tool.
|
collectl is an open-source performance gathering/analysis tool.
|
||||||
|
|
||||||
If the diagnostics report a message such as ``sda: drive is slow``, you
|
If the diagnostics report a message such as ``sda: drive is slow``, you
|
||||||
should log onto the node and run the following comand:
|
should log onto the node and run the following command (remove ``-c 1`` option to continuously monitor
|
||||||
|
the data):
|
||||||
|
|
||||||
.. code::
|
.. code::
|
||||||
|
|
||||||
@ -431,13 +465,12 @@ should log onto the node and run the following comand:
|
|||||||
dm-3 0 0 0 0 0 0 0 0 0 0 0 0 0
|
dm-3 0 0 0 0 0 0 0 0 0 0 0 0 0
|
||||||
dm-4 0 0 0 0 0 0 0 0 0 0 0 0 0
|
dm-4 0 0 0 0 0 0 0 0 0 0 0 0 0
|
||||||
dm-5 0 0 0 0 0 0 0 0 0 0 0 0 0
|
dm-5 0 0 0 0 0 0 0 0 0 0 0 0 0
|
||||||
...
|
|
||||||
(repeats -- type Ctrl/C to stop)
|
|
||||||
|
|
||||||
Look at the ``Wait`` and ``SvcTime`` values. It is not normal for
|
Look at the ``Wait`` and ``SvcTime`` values. It is not normal for
|
||||||
these values to exceed 50msec. This is known to impact customer
|
these values to exceed 50msec. This is known to impact customer
|
||||||
performance (upload/download. For a controller problem, many/all drives
|
performance (upload/download). For a controller problem, many/all drives
|
||||||
will show how wait and service times. A reboot may correct the prblem;
|
will show long wait and service times. A reboot may correct the problem;
|
||||||
otherwise hardware replacement is needed.
|
otherwise hardware replacement is needed.
|
||||||
|
|
||||||
Another way to look at the data is as follows:
|
Another way to look at the data is as follows:
|
||||||
@ -526,12 +559,12 @@ be disabled on a per-drive basis.
|
|||||||
Diagnose: Slow network link - Measuring network performance
|
Diagnose: Slow network link - Measuring network performance
|
||||||
-----------------------------------------------------------
|
-----------------------------------------------------------
|
||||||
|
|
||||||
Network faults can cause performance between Swift nodes to degrade. The
|
Network faults can cause performance between Swift nodes to degrade. Testing
|
||||||
following tests are recommended. Other methods (such as copying large
|
with ``netperf`` is recommended. Other methods (such as copying large
|
||||||
files) may also work, but can produce inconclusive results.
|
files) may also work, but can produce inconclusive results.
|
||||||
|
|
||||||
Use netperf on all production systems. Install on all systems if not
|
Install ``netperf`` on all systems if not
|
||||||
already installed. And the UFW rules for its control port are in place.
|
already installed. Check that the UFW rules for its control port are in place.
|
||||||
However, there are no pre-opened ports for netperf's data connection. Pick a
|
However, there are no pre-opened ports for netperf's data connection. Pick a
|
||||||
port number. In this example, 12866 is used because it is one higher
|
port number. In this example, 12866 is used because it is one higher
|
||||||
than netperf's default control port number, 12865. If you get very
|
than netperf's default control port number, 12865. If you get very
|
||||||
@ -561,11 +594,11 @@ Running tests
|
|||||||
|
|
||||||
#. On the ``source`` node, run the following command to check
|
#. On the ``source`` node, run the following command to check
|
||||||
throughput. Note the double-dash before the -P option.
|
throughput. Note the double-dash before the -P option.
|
||||||
The command takes 10 seconds to complete.
|
The command takes 10 seconds to complete. The ``target`` node is 192.168.245.5.
|
||||||
|
|
||||||
.. code::
|
.. code::
|
||||||
|
|
||||||
$ netperf -H <redacted>.72.4
|
$ netperf -H 192.168.245.5 -- -P 12866
|
||||||
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 12866 AF_INET to
|
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 12866 AF_INET to
|
||||||
<redacted>.72.4 (<redacted>.72.4) port 12866 AF_INET : demo
|
<redacted>.72.4 (<redacted>.72.4) port 12866 AF_INET : demo
|
||||||
Recv Send Send
|
Recv Send Send
|
||||||
@ -578,7 +611,7 @@ Running tests
|
|||||||
|
|
||||||
.. code::
|
.. code::
|
||||||
|
|
||||||
$ netperf -H <redacted>.72.4 -t TCP_RR -- -P 12866
|
$ netperf -H 192.168.245.5 -t TCP_RR -- -P 12866
|
||||||
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12866
|
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12866
|
||||||
AF_INET to <redacted>.72.4 (<redacted>.72.4) port 12866 AF_INET : demo
|
AF_INET to <redacted>.72.4 (<redacted>.72.4) port 12866 AF_INET : demo
|
||||||
: first burst 0
|
: first burst 0
|
||||||
@ -763,7 +796,7 @@ Diagnose: High system latency
|
|||||||
used by the monitor program happen to live on the bad object server.
|
used by the monitor program happen to live on the bad object server.
|
||||||
|
|
||||||
- A general network problem within the data canter. Compare the results
|
- A general network problem within the data canter. Compare the results
|
||||||
with the Pingdom monitors too see if they also have a problem.
|
with the Pingdom monitors to see if they also have a problem.
|
||||||
|
|
||||||
Diagnose: Interface reports errors
|
Diagnose: Interface reports errors
|
||||||
----------------------------------
|
----------------------------------
|
||||||
@ -802,59 +835,21 @@ If the nick supports self test, this can be performed with:
|
|||||||
Self tests should read ``PASS`` if the nic is operating correctly.
|
Self tests should read ``PASS`` if the nic is operating correctly.
|
||||||
|
|
||||||
Nic module drivers can be re-initialised by carefully removing and
|
Nic module drivers can be re-initialised by carefully removing and
|
||||||
re-installing the modules. Case in point being the mellanox drivers on
|
re-installing the modules (this avoids rebooting the server).
|
||||||
Swift Proxy servers. which use a two part driver mlx4_en and
|
For example, mellanox drivers use a two part driver mlx4_en and
|
||||||
mlx4_core. To reload these you must carefully remove the mlx4_en
|
mlx4_core. To reload these you must carefully remove the mlx4_en
|
||||||
(ethernet) then the mlx4_core modules, and reinstall them in the
|
(ethernet) then the mlx4_core modules, and reinstall them in the
|
||||||
reverse order.
|
reverse order.
|
||||||
|
|
||||||
As the interface will be disabled while the modules are unloaded, you
|
As the interface will be disabled while the modules are unloaded, you
|
||||||
must be very careful not to lock the interface out. The following
|
must be very careful not to lock yourself out so it may be better
|
||||||
script can be used to reload the melanox drivers, as a side effect, this
|
to script this.
|
||||||
resets error counts on the interface.
|
|
||||||
|
|
||||||
|
|
||||||
Diagnose: CorruptDir diagnostic reports corrupt directories
|
|
||||||
-----------------------------------------------------------
|
|
||||||
|
|
||||||
From time to time Swift data structures may become corrupted by
|
|
||||||
misplaced files in filesystem locations that swift would normally place
|
|
||||||
a directory. This causes issues for swift when directory creation is
|
|
||||||
attempted at said location, it may fail due to the pre-existent file. If
|
|
||||||
the CorruptDir diagnostic reports Corrupt directories, they should be
|
|
||||||
checked to see if they exist.
|
|
||||||
|
|
||||||
Checking existence of entries
|
|
||||||
-----------------------------
|
|
||||||
|
|
||||||
Swift data filesystems are located under the ``/srv/node/disk``
|
|
||||||
mountpoints and contain accounts, containers and objects
|
|
||||||
subdirectories which in turn contain partition number subdirectories.
|
|
||||||
The partition number directories contain md5 hash subdirectories. md5
|
|
||||||
hash directories contain md5sum subdirectories. md5sum directories
|
|
||||||
contain the Swift data payload as either a database (.db), for
|
|
||||||
accounts and containers, or a data file (.data) for objects.
|
|
||||||
If the entries reported in diagnostics correspond to a partition
|
|
||||||
number, md5 hash or md5sum directory, check the entry with ``ls
|
|
||||||
-ld *entry*``.
|
|
||||||
If it turns out to be a file rather than a directory, it should be
|
|
||||||
carefully removed.
|
|
||||||
|
|
||||||
.. note::
|
|
||||||
|
|
||||||
Please do not ``ls`` the partition level directory contents, as
|
|
||||||
this *especially objects* may take a lot of time and system resources,
|
|
||||||
if you need to check the contents, use:
|
|
||||||
|
|
||||||
.. code::
|
|
||||||
|
|
||||||
echo /srv/node/disk#/type/partition#/
|
|
||||||
|
|
||||||
Diagnose: Hung swift object replicator
|
Diagnose: Hung swift object replicator
|
||||||
--------------------------------------
|
--------------------------------------
|
||||||
|
|
||||||
The swift diagnostic message ``Object replicator: remaining exceeds
|
A replicator reports in its log that remaining time exceeds
|
||||||
100hrs:`` may indicate that the swift ``object-replicator`` is stuck and not
|
100 hours. This may indicate that the swift ``object-replicator`` is stuck and not
|
||||||
making progress. Another useful way to check this is with the
|
making progress. Another useful way to check this is with the
|
||||||
'swift-recon -r' command on a swift proxy server:
|
'swift-recon -r' command on a swift proxy server:
|
||||||
|
|
||||||
@ -866,14 +861,13 @@ making progress. Another useful way to check this is with the
|
|||||||
--> Starting reconnaissance on 384 hosts
|
--> Starting reconnaissance on 384 hosts
|
||||||
===============================================================================
|
===============================================================================
|
||||||
[2013-07-17 12:56:19] Checking on replication
|
[2013-07-17 12:56:19] Checking on replication
|
||||||
http://<redacted>.72.63:6000/recon/replication: <urlopen error timed out>
|
|
||||||
[replication_time] low: 2, high: 80, avg: 28.8, total: 11037, Failed: 0.0%, no_result: 0, reported: 383
|
[replication_time] low: 2, high: 80, avg: 28.8, total: 11037, Failed: 0.0%, no_result: 0, reported: 383
|
||||||
Oldest completion was 2013-06-12 22:46:50 (12 days ago) by <redacted>.31:6000.
|
Oldest completion was 2013-06-12 22:46:50 (12 days ago) by 192.168.245.3:6000.
|
||||||
Most recent completion was 2013-07-17 12:56:19 (5 seconds ago) by <redacted>.204.113:6000.
|
Most recent completion was 2013-07-17 12:56:19 (5 seconds ago) by 192.168.245.5:6000.
|
||||||
===============================================================================
|
===============================================================================
|
||||||
|
|
||||||
The ``Oldest completion`` line in this example indicates that the
|
The ``Oldest completion`` line in this example indicates that the
|
||||||
object-replicator on swift object server <redacted>.31 has not completed
|
object-replicator on swift object server 192.168.245.3 has not completed
|
||||||
the replication cycle in 12 days. This replicator is stuck. The object
|
the replication cycle in 12 days. This replicator is stuck. The object
|
||||||
replicator cycle is generally less than 1 hour. Though an replicator
|
replicator cycle is generally less than 1 hour. Though an replicator
|
||||||
cycle of 15-20 hours can occur if nodes are added to the system and a
|
cycle of 15-20 hours can occur if nodes are added to the system and a
|
||||||
@ -886,22 +880,22 @@ the following command:
|
|||||||
.. code::
|
.. code::
|
||||||
|
|
||||||
# sudo grep object-rep /var/log/swift/background.log | grep -e "Starting object replication" -e "Object replication complete" -e "partitions rep"
|
# sudo grep object-rep /var/log/swift/background.log | grep -e "Starting object replication" -e "Object replication complete" -e "partitions rep"
|
||||||
Jul 16 06:25:46 <redacted> object-replicator 15344/16450 (93.28%) partitions replicated in 69018.48s (0.22/sec, 22h remaining)
|
Jul 16 06:25:46 192.168.245.4 object-replicator 15344/16450 (93.28%) partitions replicated in 69018.48s (0.22/sec, 22h remaining)
|
||||||
Jul 16 06:30:46 <redacted> object-replicator 15344/16450 (93.28%) partitions replicated in 69318.58s (0.22/sec, 22h remaining)
|
Jul 16 06:30:46 192.168.245.4object-replicator 15344/16450 (93.28%) partitions replicated in 69318.58s (0.22/sec, 22h remaining)
|
||||||
Jul 16 06:35:46 <redacted> object-replicator 15344/16450 (93.28%) partitions replicated in 69618.63s (0.22/sec, 23h remaining)
|
Jul 16 06:35:46 192.168.245.4 object-replicator 15344/16450 (93.28%) partitions replicated in 69618.63s (0.22/sec, 23h remaining)
|
||||||
Jul 16 06:40:46 <redacted> object-replicator 15344/16450 (93.28%) partitions replicated in 69918.73s (0.22/sec, 23h remaining)
|
Jul 16 06:40:46 192.168.245.4 object-replicator 15344/16450 (93.28%) partitions replicated in 69918.73s (0.22/sec, 23h remaining)
|
||||||
Jul 16 06:45:46 <redacted> object-replicator 15348/16450 (93.30%) partitions replicated in 70218.75s (0.22/sec, 24h remaining)
|
Jul 16 06:45:46 192.168.245.4 object-replicator 15348/16450 (93.30%) partitions replicated in 70218.75s (0.22/sec, 24h remaining)
|
||||||
Jul 16 06:50:47 <redacted> object-replicator 15348/16450 (93.30%) partitions replicated in 70518.85s (0.22/sec, 24h remaining)
|
Jul 16 06:50:47 192.168.245.4object-replicator 15348/16450 (93.30%) partitions replicated in 70518.85s (0.22/sec, 24h remaining)
|
||||||
Jul 16 06:55:47 <redacted> object-replicator 15348/16450 (93.30%) partitions replicated in 70818.95s (0.22/sec, 25h remaining)
|
Jul 16 06:55:47 192.168.245.4 object-replicator 15348/16450 (93.30%) partitions replicated in 70818.95s (0.22/sec, 25h remaining)
|
||||||
Jul 16 07:00:47 <redacted> object-replicator 15348/16450 (93.30%) partitions replicated in 71119.05s (0.22/sec, 25h remaining)
|
Jul 16 07:00:47 192.168.245.4 object-replicator 15348/16450 (93.30%) partitions replicated in 71119.05s (0.22/sec, 25h remaining)
|
||||||
Jul 16 07:05:47 <redacted> object-replicator 15348/16450 (93.30%) partitions replicated in 71419.15s (0.21/sec, 26h remaining)
|
Jul 16 07:05:47 192.168.245.4 object-replicator 15348/16450 (93.30%) partitions replicated in 71419.15s (0.21/sec, 26h remaining)
|
||||||
Jul 16 07:10:47 <redacted> object-replicator 15348/16450 (93.30%) partitions replicated in 71719.25s (0.21/sec, 26h remaining)
|
Jul 16 07:10:47 192.168.245.4object-replicator 15348/16450 (93.30%) partitions replicated in 71719.25s (0.21/sec, 26h remaining)
|
||||||
Jul 16 07:15:47 <redacted> object-replicator 15348/16450 (93.30%) partitions replicated in 72019.27s (0.21/sec, 27h remaining)
|
Jul 16 07:15:47 192.168.245.4 object-replicator 15348/16450 (93.30%) partitions replicated in 72019.27s (0.21/sec, 27h remaining)
|
||||||
Jul 16 07:20:47 <redacted> object-replicator 15348/16450 (93.30%) partitions replicated in 72319.37s (0.21/sec, 27h remaining)
|
Jul 16 07:20:47 192.168.245.4object-replicator 15348/16450 (93.30%) partitions replicated in 72319.37s (0.21/sec, 27h remaining)
|
||||||
Jul 16 07:25:47 <redacted> object-replicator 15348/16450 (93.30%) partitions replicated in 72619.47s (0.21/sec, 28h remaining)
|
Jul 16 07:25:47 192.168.245.4 object-replicator 15348/16450 (93.30%) partitions replicated in 72619.47s (0.21/sec, 28h remaining)
|
||||||
Jul 16 07:30:47 <redacted> object-replicator 15348/16450 (93.30%) partitions replicated in 72919.56s (0.21/sec, 28h remaining)
|
Jul 16 07:30:47 192.168.245.4 object-replicator 15348/16450 (93.30%) partitions replicated in 72919.56s (0.21/sec, 28h remaining)
|
||||||
Jul 16 07:35:47 <redacted> object-replicator 15348/16450 (93.30%) partitions replicated in 73219.67s (0.21/sec, 29h remaining)
|
Jul 16 07:35:47 192.168.245.4 object-replicator 15348/16450 (93.30%) partitions replicated in 73219.67s (0.21/sec, 29h remaining)
|
||||||
Jul 16 07:40:47 <redacted> object-replicator 15348/16450 (93.30%) partitions replicated in 73519.76s (0.21/sec, 29h remaining)
|
Jul 16 07:40:47 192.168.245.4 object-replicator 15348/16450 (93.30%) partitions replicated in 73519.76s (0.21/sec, 29h remaining)
|
||||||
|
|
||||||
The above status is output every 5 minutes to ``/var/log/swift/background.log``.
|
The above status is output every 5 minutes to ``/var/log/swift/background.log``.
|
||||||
|
|
||||||
@ -921,7 +915,7 @@ of a corrupted filesystem detected by the object replicator:
|
|||||||
.. code::
|
.. code::
|
||||||
|
|
||||||
# sudo bzgrep "Remote I/O error" /var/log/swift/background.log* |grep srv | - tail -1
|
# sudo bzgrep "Remote I/O error" /var/log/swift/background.log* |grep srv | - tail -1
|
||||||
Jul 12 03:33:30 <redacted> object-replicator STDOUT: ERROR:root:Error hashing suffix#012Traceback (most recent call last):#012 File
|
Jul 12 03:33:30 192.168.245.4 object-replicator STDOUT: ERROR:root:Error hashing suffix#012Traceback (most recent call last):#012 File
|
||||||
"/usr/lib/python2.7/dist-packages/swift/obj/replicator.py", line 199, in get_hashes#012 hashes[suffix] = hash_suffix(suffix_dir,
|
"/usr/lib/python2.7/dist-packages/swift/obj/replicator.py", line 199, in get_hashes#012 hashes[suffix] = hash_suffix(suffix_dir,
|
||||||
reclaim_age)#012 File "/usr/lib/python2.7/dist-packages/swift/obj/replicator.py", line 84, in hash_suffix#012 path_contents =
|
reclaim_age)#012 File "/usr/lib/python2.7/dist-packages/swift/obj/replicator.py", line 84, in hash_suffix#012 path_contents =
|
||||||
sorted(os.listdir(path))#012OSError: [Errno 121] Remote I/O error: '/srv/node/disk4/objects/1643763/b51'
|
sorted(os.listdir(path))#012OSError: [Errno 121] Remote I/O error: '/srv/node/disk4/objects/1643763/b51'
|
||||||
@ -996,7 +990,7 @@ to repair the problem filesystem.
|
|||||||
# sudo xfs_repair -P /dev/sde1
|
# sudo xfs_repair -P /dev/sde1
|
||||||
|
|
||||||
#. If the ``xfs_repair`` fails then it may be necessary to re-format the
|
#. If the ``xfs_repair`` fails then it may be necessary to re-format the
|
||||||
filesystem. See Procedure: fix broken XFS filesystem. If the
|
filesystem. See :ref:`fix_broken_xfs_filesystem`. If the
|
||||||
``xfs_repair`` is successful, re-enable chef using the following command
|
``xfs_repair`` is successful, re-enable chef using the following command
|
||||||
and replication should commence again.
|
and replication should commence again.
|
||||||
|
|
||||||
@ -1025,7 +1019,183 @@ load:
|
|||||||
$ uptime
|
$ uptime
|
||||||
07:44:02 up 18:22, 1 user, load average: 407.12, 406.36, 404.59
|
07:44:02 up 18:22, 1 user, load average: 407.12, 406.36, 404.59
|
||||||
|
|
||||||
.. toctree::
|
Further issues and resolutions
|
||||||
:maxdepth: 2
|
------------------------------
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
The urgency levels in each **Action** column indicates whether or
|
||||||
|
not it is required to take immediate action, or if the problem can be worked
|
||||||
|
on during business hours.
|
||||||
|
|
||||||
|
.. list-table::
|
||||||
|
:widths: 33 33 33
|
||||||
|
:header-rows: 1
|
||||||
|
|
||||||
|
* - **Scenario**
|
||||||
|
- **Description**
|
||||||
|
- **Action**
|
||||||
|
* - ``/healthcheck`` latency is high.
|
||||||
|
- The ``/healthcheck`` test does not tax the proxy very much so any drop in value is probably related to
|
||||||
|
network issues, rather than the proxies being very busy. A very slow proxy might impact the average
|
||||||
|
number, but it would need to be very slow to shift the number that much.
|
||||||
|
- Check networks. Do a ``curl https://<ip-address>:<port>/healthcheck`` where
|
||||||
|
``ip-address`` is individual proxy IP address.
|
||||||
|
Repeat this for every proxy server to see if you can pin point the problem.
|
||||||
|
|
||||||
|
Urgency: If there are other indications that your system is slow, you should treat
|
||||||
|
this as an urgent problem.
|
||||||
|
* - Swift process is not running.
|
||||||
|
- You can use ``swift-init`` status to check if swift processes are running on any
|
||||||
|
given server.
|
||||||
|
- Run this command:
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
sudo swift-init all start
|
||||||
|
|
||||||
|
Examine messages in the swift log files to see if there are any
|
||||||
|
error messages related to any of the swift processes since the time you
|
||||||
|
ran the ``swift-init`` command.
|
||||||
|
|
||||||
|
Take any corrective actions that seem necessary.
|
||||||
|
|
||||||
|
Urgency: If this only affects one server, and you have more than one,
|
||||||
|
identifying and fixing the problem can wait until business hours.
|
||||||
|
If this same problem affects many servers, then you need to take corrective
|
||||||
|
action immediately.
|
||||||
|
* - ntpd is not running.
|
||||||
|
- NTP is not running.
|
||||||
|
- Configure and start NTP.
|
||||||
|
|
||||||
|
Urgency: For proxy servers, this is vital.
|
||||||
|
|
||||||
|
* - Host clock is not syncd to an NTP server.
|
||||||
|
- Node time settings does not match NTP server time.
|
||||||
|
This may take some time to sync after a reboot.
|
||||||
|
- Assuming NTP is configured and running, you have to wait until the times sync.
|
||||||
|
* - A swift process has hundreds, to thousands of open file descriptors.
|
||||||
|
- May happen to any of the swift processes.
|
||||||
|
Known to have happened with a ``rsyslod`` restart and where ``/tmp`` was hanging.
|
||||||
|
|
||||||
|
- Restart the swift processes on the affected node:
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
% sudo swift-init all reload
|
||||||
|
|
||||||
|
Urgency:
|
||||||
|
If known performance problem: Immediate
|
||||||
|
|
||||||
|
If system seems fine: Medium
|
||||||
|
* - A swift process is not owned by the swift user.
|
||||||
|
- If the UID of the swift user has changed, then the processes might not be
|
||||||
|
owned by that UID.
|
||||||
|
- Urgency: If this only affects one server, and you have more than one,
|
||||||
|
identifying and fixing the problem can wait until business hours.
|
||||||
|
If this same problem affects many servers, then you need to take corrective
|
||||||
|
action immediately.
|
||||||
|
* - Object account or container files not owned by swift.
|
||||||
|
- This typically happens if during a reinstall or a re-image of a server that the UID
|
||||||
|
of the swift user was changed. The data files in the object account and container
|
||||||
|
directories are owned by the original swift UID. As a result, the current swift
|
||||||
|
user does not own these files.
|
||||||
|
- Correct the UID of the swift user to reflect that of the original UID. An alternate
|
||||||
|
action is to change the ownership of every file on all file systems. This alternate
|
||||||
|
action is often impractical and will take considerable time.
|
||||||
|
|
||||||
|
Urgency: If this only affects one server, and you have more than one,
|
||||||
|
identifying and fixing the problem can wait until business hours.
|
||||||
|
If this same problem affects many servers, then you need to take corrective
|
||||||
|
action immediately.
|
||||||
|
* - A disk drive has a high IO wait or service time.
|
||||||
|
- If high wait IO times are seen for a single disk, then the disk drive is the problem.
|
||||||
|
If most/all devices are slow, the controller is probably the source of the problem.
|
||||||
|
The controller cache may also be miss configured – which will cause similar long
|
||||||
|
wait or service times.
|
||||||
|
- As a first step, if your controllers have a cache, check that it is enabled and their battery/capacitor
|
||||||
|
is working.
|
||||||
|
|
||||||
|
Second, reboot the server.
|
||||||
|
If problem persists, file a DC ticket to have the drive or controller replaced.
|
||||||
|
See :ref:`diagnose_slow_disk_drives` on how to check the drive wait or service times.
|
||||||
|
|
||||||
|
Urgency: Medium
|
||||||
|
* - The network interface is not up.
|
||||||
|
- Use the ``ifconfig`` and ``ethtool`` commands to determine the network state.
|
||||||
|
- You can try restarting the interface. However, generally the interface
|
||||||
|
(or cable) is probably broken, especially if the interface is flapping.
|
||||||
|
|
||||||
|
Urgency: If this only affects one server, and you have more than one,
|
||||||
|
identifying and fixing the problem can wait until business hours.
|
||||||
|
If this same problem affects many servers, then you need to take corrective
|
||||||
|
action immediately.
|
||||||
|
* - Network interface card (NIC) is not operating at the expected speed.
|
||||||
|
- The NIC is running at a slower speed than its nominal rated speed.
|
||||||
|
For example, it is running at 100 Mb/s and the NIC is a 1Ge NIC.
|
||||||
|
- 1. Try resetting the interface with:
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
sudo ethtool -s eth0 speed 1000
|
||||||
|
|
||||||
|
... and then run:
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
sudo lshw -class
|
||||||
|
|
||||||
|
See if size goes to the expected speed. Failing
|
||||||
|
that, check hardware (NIC cable/switch port).
|
||||||
|
|
||||||
|
2. If persistent, consider shutting down the server (especially if a proxy)
|
||||||
|
until the problem is identified and resolved. If you leave this server
|
||||||
|
running it can have a large impact on overall performance.
|
||||||
|
|
||||||
|
Urgency: High
|
||||||
|
* - The interface RX/TX error count is non-zero.
|
||||||
|
- A value of 0 is typical, but counts of 1 or 2 do not indicate a problem.
|
||||||
|
- 1. For low numbers (For example, 1 or 2), you can simply ignore. Numbers in the range
|
||||||
|
3-30 probably indicate that the error count has crept up slowly over a long time.
|
||||||
|
Consider rebooting the server to remove the report from the noise.
|
||||||
|
|
||||||
|
Typically, when a cable or interface is bad, the error count goes to 400+. For example,
|
||||||
|
it stands out. There may be other symptoms such as the interface going up and down or
|
||||||
|
not running at correct speed. A server with a high error count should be watched.
|
||||||
|
|
||||||
|
2. If the error count continues to climb, consider taking the server down until
|
||||||
|
it can be properly investigated. In any case, a reboot should be done to clear
|
||||||
|
the error count.
|
||||||
|
|
||||||
|
Urgency: High, if the error count increasing.
|
||||||
|
|
||||||
|
* - In a swift log you see a message that a process has not replicated in over 24 hours.
|
||||||
|
- The replicator has not successfully completed a run in the last 24 hours.
|
||||||
|
This indicates that the replicator has probably hung.
|
||||||
|
- Use ``swift-init`` to stop and then restart the replicator process.
|
||||||
|
|
||||||
|
Urgency: Low. However if you
|
||||||
|
recently added or replaced disk drives then you should treat this urgently.
|
||||||
|
* - Container Updater has not run in 4 hour(s).
|
||||||
|
- The service may appear to be running however, it may be hung. Examine their swift
|
||||||
|
logs to see if there are any error messages relating to the container updater. This
|
||||||
|
may potentially explain why the container is not running.
|
||||||
|
- Urgency: Medium
|
||||||
|
This may have been triggered by a recent restart of the rsyslog daemon.
|
||||||
|
Restart the service with:
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
sudo swift-init <service> reload
|
||||||
|
* - Object replicator: Reports the remaining time and that time is more than 100 hours.
|
||||||
|
- Each replication cycle the object replicator writes a log message to its log
|
||||||
|
reporting statistics about the current cycle. This includes an estimate for the
|
||||||
|
remaining time needed to replicate all objects. If this time is longer than
|
||||||
|
100 hours, there is a problem with the replication process.
|
||||||
|
- Urgency: Medium
|
||||||
|
Restart the service with:
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
sudo swift-init object-replicator reload
|
||||||
|
|
||||||
|
Check that the remaining replication time is going down.
|
||||||
|
|
||||||
sec-furtherdiagnose.rst
|
|
||||||
|
@ -1,36 +0,0 @@
|
|||||||
==================
|
|
||||||
General Procedures
|
|
||||||
==================
|
|
||||||
|
|
||||||
Getting a swift account stats
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
.. note::
|
|
||||||
|
|
||||||
``swift-direct`` is specific to the HPE Helion Public Cloud. Go look at
|
|
||||||
``swifty`` for an alternate, this is an example.
|
|
||||||
|
|
||||||
This procedure describes how you determine the swift usage for a given
|
|
||||||
swift account, that is the number of containers, number of objects and
|
|
||||||
total bytes used. To do this you will need the project ID.
|
|
||||||
|
|
||||||
Log onto one of the swift proxy servers.
|
|
||||||
|
|
||||||
Use swift-direct to show this accounts usage:
|
|
||||||
|
|
||||||
.. code::
|
|
||||||
|
|
||||||
$ sudo -u swift /opt/hp/swift/bin/swift-direct show AUTH_redacted-9a11-45f8-aa1c-9e7b1c7904c8
|
|
||||||
Status: 200
|
|
||||||
Content-Length: 0
|
|
||||||
Accept-Ranges: bytes
|
|
||||||
X-Timestamp: 1379698586.88364
|
|
||||||
X-Account-Bytes-Used: 67440225625994
|
|
||||||
X-Account-Container-Count: 1
|
|
||||||
Content-Type: text/plain; charset=utf-8
|
|
||||||
X-Account-Object-Count: 8436776
|
|
||||||
Status: 200
|
|
||||||
name: my_container count: 8436776 bytes: 67440225625994
|
|
||||||
|
|
||||||
This account has 1 container. That container has 8436776 objects. The
|
|
||||||
total bytes used is 67440225625994.
|
|
@ -13,67 +13,15 @@ information, suggestions or recommendations. This document are provided
|
|||||||
for reference only. We are not responsible for your use of any
|
for reference only. We are not responsible for your use of any
|
||||||
information, suggestions or recommendations contained herein.
|
information, suggestions or recommendations contained herein.
|
||||||
|
|
||||||
This document also contains references to certain tools that we use to
|
|
||||||
operate the Swift system within the HPE Helion Public Cloud.
|
|
||||||
Descriptions of these tools are provided for reference only, as the tools themselves
|
|
||||||
are not publically available at this time.
|
|
||||||
|
|
||||||
- ``swift-direct``: This is similar to the ``swiftly`` tool.
|
|
||||||
|
|
||||||
|
|
||||||
.. toctree::
|
.. toctree::
|
||||||
:maxdepth: 2
|
:maxdepth: 2
|
||||||
|
|
||||||
general.rst
|
|
||||||
diagnose.rst
|
diagnose.rst
|
||||||
procedures.rst
|
procedures.rst
|
||||||
maintenance.rst
|
maintenance.rst
|
||||||
troubleshooting.rst
|
troubleshooting.rst
|
||||||
|
|
||||||
Is the system up?
|
|
||||||
~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
If you have a report that Swift is down, perform the following basic checks:
|
|
||||||
|
|
||||||
#. Run swift functional tests.
|
|
||||||
|
|
||||||
#. From a server in your data center, use ``curl`` to check ``/healthcheck``.
|
|
||||||
|
|
||||||
#. If you have a monitoring system, check your monitoring system.
|
|
||||||
|
|
||||||
#. Check on your hardware load balancers infrastructure.
|
|
||||||
|
|
||||||
#. Run swift-recon on a proxy node.
|
|
||||||
|
|
||||||
Run swift function tests
|
|
||||||
------------------------
|
|
||||||
|
|
||||||
We would recommend that you set up your function tests against your production
|
|
||||||
system.
|
|
||||||
|
|
||||||
A script for running the function tests is located in ``swift/.functests``.
|
|
||||||
|
|
||||||
|
|
||||||
External monitoring
|
|
||||||
-------------------
|
|
||||||
|
|
||||||
- We use pingdom.com to monitor the external Swift API. We suggest the
|
|
||||||
following:
|
|
||||||
|
|
||||||
- Do a GET on ``/healthcheck``
|
|
||||||
|
|
||||||
- Create a container, make it public (x-container-read:
|
|
||||||
.r\*,.rlistings), create a small file in the container; do a GET
|
|
||||||
on the object
|
|
||||||
|
|
||||||
Reference information
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
Reference: Swift startup/shutdown
|
|
||||||
---------------------------------
|
|
||||||
|
|
||||||
- Use reload - not stop/start/restart.
|
|
||||||
|
|
||||||
- Try to roll sets of servers (especially proxy) in groups of less
|
|
||||||
than 20% of your servers.
|
|
||||||
|
|
||||||
|
@ -54,8 +54,8 @@ system. Rules-of-thumb for 'good' recon output are:
|
|||||||
|
|
||||||
.. code::
|
.. code::
|
||||||
|
|
||||||
\-> [http://<redacted>.29:6000/recon/load:] <urlopen error [Errno 111] ECONNREFUSED>
|
-> [http://<redacted>.29:6000/recon/load:] <urlopen error [Errno 111] ECONNREFUSED>
|
||||||
\-> [http://<redacted>.31:6000/recon/load:] <urlopen error timed out>
|
-> [http://<redacted>.31:6000/recon/load:] <urlopen error timed out>
|
||||||
|
|
||||||
- That could be okay or could require investigation.
|
- That could be okay or could require investigation.
|
||||||
|
|
||||||
@ -154,18 +154,18 @@ Running reccon shows some async pendings:
|
|||||||
|
|
||||||
.. code::
|
.. code::
|
||||||
|
|
||||||
bob@notso:~/swift-1.4.4/swift$ ssh \\-q <redacted>.132.7 sudo swift-recon \\-alr
|
bob@notso:~/swift-1.4.4/swift$ ssh -q <redacted>.132.7 sudo swift-recon -alr
|
||||||
===============================================================================
|
===============================================================================
|
||||||
\[2012-03-14 17:25:55\\] Checking async pendings on 384 hosts...
|
[2012-03-14 17:25:55] Checking async pendings on 384 hosts...
|
||||||
Async stats: low: 0, high: 23, avg: 8, total: 3356
|
Async stats: low: 0, high: 23, avg: 8, total: 3356
|
||||||
===============================================================================
|
===============================================================================
|
||||||
\[2012-03-14 17:25:55\\] Checking replication times on 384 hosts...
|
[2012-03-14 17:25:55] Checking replication times on 384 hosts...
|
||||||
\[Replication Times\\] shortest: 1.49303831657, longest: 39.6982825994, avg: 4.2418222066
|
[Replication Times] shortest: 1.49303831657, longest: 39.6982825994, avg: 4.2418222066
|
||||||
===============================================================================
|
===============================================================================
|
||||||
\[2012-03-14 17:25:56\\] Checking load avg's on 384 hosts...
|
[2012-03-14 17:25:56] Checking load avg's on 384 hosts...
|
||||||
\[5m load average\\] lowest: 2.35, highest: 8.88, avg: 4.45911458333
|
[5m load average] lowest: 2.35, highest: 8.88, avg: 4.45911458333
|
||||||
\[15m load average\\] lowest: 2.41, highest: 9.11, avg: 4.504765625
|
[15m load average] lowest: 2.41, highest: 9.11, avg: 4.504765625
|
||||||
\[1m load average\\] lowest: 1.95, highest: 8.56, avg: 4.40588541667
|
[1m load average] lowest: 1.95, highest: 8.56, avg: 4.40588541667
|
||||||
===============================================================================
|
===============================================================================
|
||||||
|
|
||||||
Why? Running recon again with -av swift (not shown here) tells us that
|
Why? Running recon again with -av swift (not shown here) tells us that
|
||||||
@ -231,7 +231,7 @@ Procedure
|
|||||||
This procedure should be run three times, each time specifying the
|
This procedure should be run three times, each time specifying the
|
||||||
appropriate ``*.builder`` file.
|
appropriate ``*.builder`` file.
|
||||||
|
|
||||||
#. Determine whether all three nodes are different Swift zones by
|
#. Determine whether all three nodes are in different Swift zones by
|
||||||
running the ring builder on a proxy node to determine which zones
|
running the ring builder on a proxy node to determine which zones
|
||||||
the storage nodes are in. For example:
|
the storage nodes are in. For example:
|
||||||
|
|
||||||
@ -253,10 +253,10 @@ Procedure
|
|||||||
have any ring partitions in common; there is little/no data
|
have any ring partitions in common; there is little/no data
|
||||||
availability risk if all three nodes are down.
|
availability risk if all three nodes are down.
|
||||||
|
|
||||||
#. If the nodes are in three distinct Swift zonesit is necessary to
|
#. If the nodes are in three distinct Swift zones it is necessary to
|
||||||
whether the nodes have ring partitions in common. Run ``swift-ring``
|
whether the nodes have ring partitions in common. Run ``swift-ring``
|
||||||
builder again, this time with the ``list_parts`` option and specify
|
builder again, this time with the ``list_parts`` option and specify
|
||||||
the nodes under consideration. For example (all on one line):
|
the nodes under consideration. For example:
|
||||||
|
|
||||||
.. code::
|
.. code::
|
||||||
|
|
||||||
@ -302,12 +302,12 @@ Procedure
|
|||||||
|
|
||||||
.. code::
|
.. code::
|
||||||
|
|
||||||
% sudo swift-ring-builder /etc/swift/object.builder list_parts <redacted>.8 <redacted>.15 <redacted>.72.2 | grep “3$” - wc \\-l
|
% sudo swift-ring-builder /etc/swift/object.builder list_parts <redacted>.8 <redacted>.15 <redacted>.72.2 | grep "3$" | wc -l
|
||||||
|
|
||||||
30
|
30
|
||||||
|
|
||||||
#. In this case the nodes have 30 out of a total of 2097152 partitions
|
#. In this case the nodes have 30 out of a total of 2097152 partitions
|
||||||
in common; about 0.001%. In this case the risk is small nonzero.
|
in common; about 0.001%. In this case the risk is small/nonzero.
|
||||||
Recall that a partition is simply a portion of the ring mapping
|
Recall that a partition is simply a portion of the ring mapping
|
||||||
space, not actual data. So having partitions in common is a necessary
|
space, not actual data. So having partitions in common is a necessary
|
||||||
but not sufficient condition for data unavailability.
|
but not sufficient condition for data unavailability.
|
||||||
@ -320,3 +320,11 @@ Procedure
|
|||||||
If three nodes that have 3 partitions in common are all down, there is
|
If three nodes that have 3 partitions in common are all down, there is
|
||||||
a nonzero probability that data are unavailable and we should work to
|
a nonzero probability that data are unavailable and we should work to
|
||||||
bring some or all of the nodes up ASAP.
|
bring some or all of the nodes up ASAP.
|
||||||
|
|
||||||
|
Swift startup/shutdown
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
- Use reload - not stop/start/restart.
|
||||||
|
|
||||||
|
- Try to roll sets of servers (especially proxy) in groups of less
|
||||||
|
than 20% of your servers.
|
@ -2,6 +2,8 @@
|
|||||||
Software configuration procedures
|
Software configuration procedures
|
||||||
=================================
|
=================================
|
||||||
|
|
||||||
|
.. _fix_broken_gpt_table:
|
||||||
|
|
||||||
Fix broken GPT table (broken disk partition)
|
Fix broken GPT table (broken disk partition)
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
@ -102,6 +104,8 @@ Fix broken GPT table (broken disk partition)
|
|||||||
|
|
||||||
$ sudo aptitude remove gdisk
|
$ sudo aptitude remove gdisk
|
||||||
|
|
||||||
|
.. _fix_broken_xfs_filesystem:
|
||||||
|
|
||||||
Procedure: Fix broken XFS filesystem
|
Procedure: Fix broken XFS filesystem
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
@ -165,7 +169,7 @@ Procedure: Fix broken XFS filesystem
|
|||||||
|
|
||||||
.. code::
|
.. code::
|
||||||
|
|
||||||
$ sudo dd if=/dev/zero of=/dev/sdb2 bs=$((1024\*1024)) count=1
|
$ sudo dd if=/dev/zero of=/dev/sdb2 bs=$((1024*1024)) count=1
|
||||||
1+0 records in
|
1+0 records in
|
||||||
1+0 records out
|
1+0 records out
|
||||||
1048576 bytes (1.0 MB) copied, 0.00480617 s, 218 MB/s
|
1048576 bytes (1.0 MB) copied, 0.00480617 s, 218 MB/s
|
||||||
@ -187,129 +191,173 @@ Procedure: Fix broken XFS filesystem
|
|||||||
|
|
||||||
$ mount
|
$ mount
|
||||||
|
|
||||||
|
.. _checking_if_account_ok:
|
||||||
|
|
||||||
Procedure: Checking if an account is okay
|
Procedure: Checking if an account is okay
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
.. note::
|
.. note::
|
||||||
|
|
||||||
``swift-direct`` is only available in the HPE Helion Public Cloud.
|
``swift-direct`` is only available in the HPE Helion Public Cloud.
|
||||||
Use ``swiftly`` as an alternate.
|
Use ``swiftly`` as an alternate (or use ``swift-get-nodes`` as explained
|
||||||
|
here).
|
||||||
|
|
||||||
If you have a tenant ID you can check the account is okay as follows from a proxy.
|
You must know the tenant/project ID. You can check if the account is okay as follows from a proxy.
|
||||||
|
|
||||||
.. code::
|
.. code::
|
||||||
|
|
||||||
$ sudo -u swift /opt/hp/swift/bin/swift-direct show <Api-Auth-Hash-or-TenantId>
|
$ sudo -u swift /opt/hp/swift/bin/swift-direct show AUTH_<project-id>
|
||||||
|
|
||||||
The response will either be similar to a swift list of the account
|
The response will either be similar to a swift list of the account
|
||||||
containers, or an error indicating that the resource could not be found.
|
containers, or an error indicating that the resource could not be found.
|
||||||
|
|
||||||
In the latter case you can establish if a backend database exists for
|
Alternatively, you can use ``swift-get-nodes`` to find the account database
|
||||||
the tenantId by running the following on a proxy:
|
files. Run the following on a proxy:
|
||||||
|
|
||||||
.. code::
|
.. code::
|
||||||
|
|
||||||
$ sudo -u swift swift-get-nodes /etc/swift/account.ring.gz <Api-Auth-Hash-or-TenantId>
|
$ sudo swift-get-nodes /etc/swift/account.ring.gz AUTH_<project-id>
|
||||||
|
|
||||||
The response will list ssh commands that will list the replicated
|
The response will print curl/ssh commands that will list the replicated
|
||||||
account databases, if they exist.
|
account databases. Use the indicated ``curl`` or ``ssh`` commands to check
|
||||||
|
the status and existence of the account.
|
||||||
|
|
||||||
|
Procedure: Getting swift account stats
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
``swift-direct`` is specific to the HPE Helion Public Cloud. Go look at
|
||||||
|
``swifty`` for an alternate or use ``swift-get-nodes`` as explained
|
||||||
|
in :ref:`checking_if_account_ok`.
|
||||||
|
|
||||||
|
This procedure describes how you determine the swift usage for a given
|
||||||
|
swift account, that is the number of containers, number of objects and
|
||||||
|
total bytes used. To do this you will need the project ID.
|
||||||
|
|
||||||
|
Log onto one of the swift proxy servers.
|
||||||
|
|
||||||
|
Use swift-direct to show this accounts usage:
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
$ sudo -u swift /opt/hp/swift/bin/swift-direct show AUTH_<project-id>
|
||||||
|
Status: 200
|
||||||
|
Content-Length: 0
|
||||||
|
Accept-Ranges: bytes
|
||||||
|
X-Timestamp: 1379698586.88364
|
||||||
|
X-Account-Bytes-Used: 67440225625994
|
||||||
|
X-Account-Container-Count: 1
|
||||||
|
Content-Type: text/plain; charset=utf-8
|
||||||
|
X-Account-Object-Count: 8436776
|
||||||
|
Status: 200
|
||||||
|
name: my_container count: 8436776 bytes: 67440225625994
|
||||||
|
|
||||||
|
This account has 1 container. That container has 8436776 objects. The
|
||||||
|
total bytes used is 67440225625994.
|
||||||
|
|
||||||
Procedure: Revive a deleted account
|
Procedure: Revive a deleted account
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
Swift accounts are normally not recreated. If a tenant unsubscribes from
|
Swift accounts are normally not recreated. If a tenant/project is deleted,
|
||||||
Swift, the account is deleted. To re-subscribe to Swift, you can create
|
the account can then be deleted. If the user wishes to use Swift again,
|
||||||
a new tenant (new tenant ID), and subscribe to Swift. This creates a
|
the normal process is to create a new tenant/project -- and hence a
|
||||||
new Swift account with the new tenant ID.
|
new Swift account.
|
||||||
|
|
||||||
However, until the unsubscribe/new tenant process is supported, you may
|
However, if the Swift account is deleted, but the tenant/project is not
|
||||||
hit a situation where a Swift account is deleted and the user is locked
|
deleted from Keystone, the user can no longer access the account. This
|
||||||
out of Swift.
|
is because the account is marked deleted in Swift. You can revive
|
||||||
|
the account as described in this process.
|
||||||
|
|
||||||
Deleting the account database files
|
.. note::
|
||||||
-----------------------------------
|
|
||||||
|
|
||||||
Here is one possible solution. The containers and objects may be lost
|
The containers and objects in the "old" account cannot be listed
|
||||||
forever. The solution is to delete the account database files and
|
anymore. In addition, if the Account Reaper process has not
|
||||||
re-create the account. This may only be done once the containers and
|
finished reaping the containers and objects in the "old" account, these
|
||||||
objects are completely deleted. This process is untested, but could
|
are effectively orphaned and it is virtually impossible to find and delete
|
||||||
work as follows:
|
them to free up disk space.
|
||||||
|
|
||||||
#. Use swift-get-nodes to locate the account's database file (on three
|
The solution is to delete the account database files and
|
||||||
servers).
|
re-create the account as follows:
|
||||||
|
|
||||||
#. Rename the database files (on three servers).
|
#. You must know the tenant/project ID. The account name is AUTH_<project-id>.
|
||||||
|
In this example, the tenant/project is is ``4ebe3039674d4864a11fe0864ae4d905``
|
||||||
|
so the Swift account name is ``AUTH_4ebe3039674d4864a11fe0864ae4d905``.
|
||||||
|
|
||||||
#. Use ``swiftly`` to create the account (use original name).
|
#. Use ``swift-get-nodes`` to locate the account's database files (on three
|
||||||
|
servers). The output has been truncated so we can focus on the import pieces
|
||||||
Renaming account database so it can be revived
|
of data:
|
||||||
----------------------------------------------
|
|
||||||
|
|
||||||
Get the locations of the database files that hold the account data.
|
|
||||||
|
|
||||||
.. code::
|
.. code::
|
||||||
|
|
||||||
sudo swift-get-nodes /etc/swift/account.ring.gz AUTH_redacted-1856-44ae-97db-31242f7ad7a1
|
$ sudo swift-get-nodes /etc/swift/account.ring.gz AUTH_4ebe3039674d4864a11fe0864ae4d905
|
||||||
|
...
|
||||||
|
curl -I -XHEAD "http://192.168.245.5:6002/disk1/3934/AUTH_4ebe3039674d4864a11fe0864ae4d905"
|
||||||
|
curl -I -XHEAD "http://192.168.245.3:6002/disk0/3934/AUTH_4ebe3039674d4864a11fe0864ae4d905"
|
||||||
|
curl -I -XHEAD "http://192.168.245.4:6002/disk1/3934/AUTH_4ebe3039674d4864a11fe0864ae4d905"
|
||||||
|
...
|
||||||
|
Use your own device location of servers:
|
||||||
|
such as "export DEVICE=/srv/node"
|
||||||
|
ssh 192.168.245.5 "ls -lah ${DEVICE:-/srv/node*}/disk1/accounts/3934/052/f5ecf8b40de3e1b0adb0dbe576874052"
|
||||||
|
ssh 192.168.245.3 "ls -lah ${DEVICE:-/srv/node*}/disk0/accounts/3934/052/f5ecf8b40de3e1b0adb0dbe576874052"
|
||||||
|
ssh 192.168.245.4 "ls -lah ${DEVICE:-/srv/node*}/disk1/accounts/3934/052/f5ecf8b40de3e1b0adb0dbe576874052"
|
||||||
|
...
|
||||||
|
note: `/srv/node*` is used as default value of `devices`, the real value is set in the config file on each storage node.
|
||||||
|
|
||||||
Account AUTH_redacted-1856-44ae-97db-31242f7ad7a1
|
|
||||||
Container None
|
|
||||||
|
|
||||||
Object None
|
#. Before proceeding check that the account is really deleted by using curl. Execute the
|
||||||
|
commands printed by ``swift-get-nodes``. For example:
|
||||||
|
|
||||||
Partition 18914
|
.. code::
|
||||||
|
|
||||||
Hash 93c41ef56dd69173a9524193ab813e78
|
$ curl -I -XHEAD "http://192.168.245.5:6002/disk1/3934/AUTH_4ebe3039674d4864a11fe0864ae4d905"
|
||||||
|
HTTP/1.1 404 Not Found
|
||||||
|
Content-Length: 0
|
||||||
|
Content-Type: text/html; charset=utf-8
|
||||||
|
|
||||||
Server:Port Device 15.184.9.126:6002 disk7
|
Repeat for the other two servers (192.168.245.3 and 192.168.245.4).
|
||||||
Server:Port Device 15.184.9.94:6002 disk11
|
A ``404 Not Found`` indicates that the account is deleted (or never existed).
|
||||||
Server:Port Device 15.184.9.103:6002 disk10
|
|
||||||
Server:Port Device 15.184.9.80:6002 disk2 [Handoff]
|
|
||||||
Server:Port Device 15.184.9.120:6002 disk2 [Handoff]
|
|
||||||
Server:Port Device 15.184.9.98:6002 disk2 [Handoff]
|
|
||||||
|
|
||||||
curl -I -XHEAD "`*http://15.184.9.126:6002/disk7/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.126:6002/disk7/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_
|
If you get a ``204 No Content`` response, do **not** proceed.
|
||||||
curl -I -XHEAD "`*http://15.184.9.94:6002/disk11/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.94:6002/disk11/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_
|
|
||||||
|
|
||||||
curl -I -XHEAD "`*http://15.184.9.103:6002/disk10/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.103:6002/disk10/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_
|
#. Use the ssh commands printed by ``swift-get-nodes`` to check if database
|
||||||
|
files exist. For example:
|
||||||
|
|
||||||
curl -I -XHEAD "`*http://15.184.9.80:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.80:6002/disk2/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ # [Handoff]
|
.. code::
|
||||||
curl -I -XHEAD "`*http://15.184.9.120:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.120:6002/disk2/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ # [Handoff]
|
|
||||||
curl -I -XHEAD "`*http://15.184.9.98:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.98:6002/disk2/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ # [Handoff]
|
|
||||||
|
|
||||||
ssh 15.184.9.126 "ls -lah /srv/node/disk7/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"
|
$ ssh 192.168.245.5 "ls -lah ${DEVICE:-/srv/node*}/disk1/accounts/3934/052/f5ecf8b40de3e1b0adb0dbe576874052"
|
||||||
ssh 15.184.9.94 "ls -lah /srv/node/disk11/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"
|
total 20K
|
||||||
ssh 15.184.9.103 "ls -lah /srv/node/disk10/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"
|
drwxr-xr-x 2 swift swift 110 Mar 9 10:22 .
|
||||||
ssh 15.184.9.80 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff]
|
drwxr-xr-x 3 swift swift 45 Mar 9 10:18 ..
|
||||||
ssh 15.184.9.120 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff]
|
-rw------- 1 swift swift 17K Mar 9 10:22 f5ecf8b40de3e1b0adb0dbe576874052.db
|
||||||
ssh 15.184.9.98 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff]
|
-rw-r--r-- 1 swift swift 0 Mar 9 10:22 f5ecf8b40de3e1b0adb0dbe576874052.db.pending
|
||||||
|
-rwxr-xr-x 1 swift swift 0 Mar 9 10:18 .lock
|
||||||
|
|
||||||
$ sudo swift-get-nodes /etc/swift/account.ring.gz AUTH\_redacted-1856-44ae-97db-31242f7ad7a1Account AUTH_redacted-1856-44ae-97db-
|
Repeat for the other two servers (192.168.245.3 and 192.168.245.4).
|
||||||
31242f7ad7a1Container NoneObject NonePartition 18914Hash 93c41ef56dd69173a9524193ab813e78Server:Port Device 15.184.9.126:6002 disk7Server:Port Device 15.184.9.94:6002 disk11Server:Port Device 15.184.9.103:6002 disk10Server:Port Device 15.184.9.80:6002
|
|
||||||
disk2 [Handoff]Server:Port Device 15.184.9.120:6002 disk2 [Handoff]Server:Port Device 15.184.9.98:6002 disk2 [Handoff]curl -I -XHEAD
|
|
||||||
"`*http://15.184.9.126:6002/disk7/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"*<http://15.184.9.126:6002/disk7/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ curl -I -XHEAD
|
|
||||||
|
|
||||||
"`*http://15.184.9.94:6002/disk11/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.94:6002/disk11/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ curl -I -XHEAD
|
If no files exist, no further action is needed.
|
||||||
|
|
||||||
"`*http://15.184.9.103:6002/disk10/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.103:6002/disk10/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ curl -I -XHEAD
|
#. Stop Swift processes on all nodes listed by ``swift-get-nodes``
|
||||||
|
(In this example, that is 192.168.245.3, 192.168.245.4 and 192.168.245.5).
|
||||||
|
|
||||||
"`*http://15.184.9.80:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.80:6002/disk2/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ # [Handoff]curl -I -XHEAD
|
#. We recommend you make backup copies of the database files.
|
||||||
|
|
||||||
"`*http://15.184.9.120:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.120:6002/disk2/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ # [Handoff]curl -I -XHEAD
|
#. Delete the database files. For example:
|
||||||
|
|
||||||
"`*http://15.184.9.98:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.98:6002/disk2/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ # [Handoff]ssh 15.184.9.126
|
.. code::
|
||||||
|
|
||||||
"ls -lah /srv/node/disk7/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"ssh 15.184.9.94 "ls -lah /srv/node/disk11/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"ssh 15.184.9.103
|
$ ssh 192.168.245.5
|
||||||
"ls -lah /srv/node/disk10/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"ssh 15.184.9.80 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff]ssh 15.184.9.120
|
$ cd /srv/node/disk1/accounts/3934/052/f5ecf8b40de3e1b0adb0dbe576874052
|
||||||
"ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff]ssh 15.184.9.98 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff]
|
$ sudo rm *
|
||||||
|
|
||||||
Check that the handoff nodes do not have account databases:
|
Repeat for the other two servers (192.168.245.3 and 192.168.245.4).
|
||||||
|
|
||||||
.. code::
|
#. Restart Swift on all three servers
|
||||||
|
|
||||||
$ ssh 15.184.9.80 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"
|
At this stage, the account is fully deleted. If you enable the auto-create option, the
|
||||||
ls: cannot access /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/: No such file or directory
|
next time the user attempts to access the account, the account will be created.
|
||||||
|
You may also use swiftly to recreate the account.
|
||||||
|
|
||||||
If the handoff node has a database, wait for rebalancing to occur.
|
|
||||||
|
|
||||||
Procedure: Temporarily stop load balancers from directing traffic to a proxy server
|
Procedure: Temporarily stop load balancers from directing traffic to a proxy server
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
@ -319,7 +367,7 @@ follows. This can be useful when a proxy is misbehaving but you need
|
|||||||
Swift running to help diagnose the problem. By removing from the load
|
Swift running to help diagnose the problem. By removing from the load
|
||||||
balancers, customer's are not impacted by the misbehaving proxy.
|
balancers, customer's are not impacted by the misbehaving proxy.
|
||||||
|
|
||||||
#. Ensure that in proxyserver.com the ``disable_path`` variable is set to
|
#. Ensure that in /etc/swift/proxy-server.conf the ``disable_path`` variable is set to
|
||||||
``/etc/swift/disabled-by-file``.
|
``/etc/swift/disabled-by-file``.
|
||||||
|
|
||||||
#. Log onto the proxy node.
|
#. Log onto the proxy node.
|
||||||
@ -346,13 +394,10 @@ balancers, customer's are not impacted by the misbehaving proxy.
|
|||||||
|
|
||||||
sudo swift-init proxy start
|
sudo swift-init proxy start
|
||||||
|
|
||||||
It works because the healthcheck middleware looks for this file. If it
|
It works because the healthcheck middleware looks for /etc/swift/disabled-by-file.
|
||||||
find it, it will return 503 error instead of 200/OK. This means the load balancer
|
If it exists, the middleware will return 503/error instead of 200/OK. This means the load balancer
|
||||||
should stop sending traffic to the proxy.
|
should stop sending traffic to the proxy.
|
||||||
|
|
||||||
``/healthcheck`` will report
|
|
||||||
``FAIL: disabled by file`` if the ``disabled-by-file`` file exists.
|
|
||||||
|
|
||||||
Procedure: Ad-Hoc disk performance test
|
Procedure: Ad-Hoc disk performance test
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
@ -1,177 +0,0 @@
|
|||||||
==============================
|
|
||||||
Further issues and resolutions
|
|
||||||
==============================
|
|
||||||
|
|
||||||
.. note::
|
|
||||||
|
|
||||||
The urgency levels in each **Action** column indicates whether or
|
|
||||||
not it is required to take immediate action, or if the problem can be worked
|
|
||||||
on during business hours.
|
|
||||||
|
|
||||||
.. list-table::
|
|
||||||
:widths: 33 33 33
|
|
||||||
:header-rows: 1
|
|
||||||
|
|
||||||
* - **Scenario**
|
|
||||||
- **Description**
|
|
||||||
- **Action**
|
|
||||||
* - ``/healthcheck`` latency is high.
|
|
||||||
- The ``/healthcheck`` test does not tax the proxy very much so any drop in value is probably related to
|
|
||||||
network issues, rather than the proxies being very busy. A very slow proxy might impact the average
|
|
||||||
number, but it would need to be very slow to shift the number that much.
|
|
||||||
- Check networks. Do a ``curl https://<ip-address>/healthcheck where ip-address`` is individual proxy
|
|
||||||
IP address to see if you can pin point a problem in the network.
|
|
||||||
|
|
||||||
Urgency: If there are other indications that your system is slow, you should treat
|
|
||||||
this as an urgent problem.
|
|
||||||
* - Swift process is not running.
|
|
||||||
- You can use ``swift-init`` status to check if swift processes are running on any
|
|
||||||
given server.
|
|
||||||
- Run this command:
|
|
||||||
.. code::
|
|
||||||
|
|
||||||
sudo swift-init all start
|
|
||||||
|
|
||||||
Examine messages in the swift log files to see if there are any
|
|
||||||
error messages related to any of the swift processes since the time you
|
|
||||||
ran the ``swift-init`` command.
|
|
||||||
|
|
||||||
Take any corrective actions that seem necessary.
|
|
||||||
|
|
||||||
Urgency: If this only affects one server, and you have more than one,
|
|
||||||
identifying and fixing the problem can wait until business hours.
|
|
||||||
If this same problem affects many servers, then you need to take corrective
|
|
||||||
action immediately.
|
|
||||||
* - ntpd is not running.
|
|
||||||
- NTP is not running.
|
|
||||||
- Configure and start NTP.
|
|
||||||
Urgency: For proxy servers, this is vital.
|
|
||||||
|
|
||||||
* - Host clock is not syncd to an NTP server.
|
|
||||||
- Node time settings does not match NTP server time.
|
|
||||||
This may take some time to sync after a reboot.
|
|
||||||
- Assuming NTP is configured and running, you have to wait until the times sync.
|
|
||||||
* - A swift process has hundreds, to thousands of open file descriptors.
|
|
||||||
- May happen to any of the swift processes.
|
|
||||||
Known to have happened with a ``rsyslod restart`` and where ``/tmp`` was hanging.
|
|
||||||
|
|
||||||
- Restart the swift processes on the affected node:
|
|
||||||
|
|
||||||
.. code::
|
|
||||||
|
|
||||||
% sudo swift-init all reload
|
|
||||||
|
|
||||||
Urgency:
|
|
||||||
If known performance problem: Immediate
|
|
||||||
|
|
||||||
If system seems fine: Medium
|
|
||||||
* - A swift process is not owned by the swift user.
|
|
||||||
- If the UID of the swift user has changed, then the processes might not be
|
|
||||||
owned by that UID.
|
|
||||||
- Urgency: If this only affects one server, and you have more than one,
|
|
||||||
identifying and fixing the problem can wait until business hours.
|
|
||||||
If this same problem affects many servers, then you need to take corrective
|
|
||||||
action immediately.
|
|
||||||
* - Object account or container files not owned by swift.
|
|
||||||
- This typically happens if during a reinstall or a re-image of a server that the UID
|
|
||||||
of the swift user was changed. The data files in the object account and container
|
|
||||||
directories are owned by the original swift UID. As a result, the current swift
|
|
||||||
user does not own these files.
|
|
||||||
- Correct the UID of the swift user to reflect that of the original UID. An alternate
|
|
||||||
action is to change the ownership of every file on all file systems. This alternate
|
|
||||||
action is often impractical and will take considerable time.
|
|
||||||
|
|
||||||
Urgency: If this only affects one server, and you have more than one,
|
|
||||||
identifying and fixing the problem can wait until business hours.
|
|
||||||
If this same problem affects many servers, then you need to take corrective
|
|
||||||
action immediately.
|
|
||||||
* - A disk drive has a high IO wait or service time.
|
|
||||||
- If high wait IO times are seen for a single disk, then the disk drive is the problem.
|
|
||||||
If most/all devices are slow, the controller is probably the source of the problem.
|
|
||||||
The controller cache may also be miss configured – which will cause similar long
|
|
||||||
wait or service times.
|
|
||||||
- As a first step, if your controllers have a cache, check that it is enabled and their battery/capacitor
|
|
||||||
is working.
|
|
||||||
|
|
||||||
Second, reboot the server.
|
|
||||||
If problem persists, file a DC ticket to have the drive or controller replaced.
|
|
||||||
See `Diagnose: Slow disk devices` on how to check the drive wait or service times.
|
|
||||||
|
|
||||||
Urgency: Medium
|
|
||||||
* - The network interface is not up.
|
|
||||||
- Use the ``ifconfig`` and ``ethtool`` commands to determine the network state.
|
|
||||||
- You can try restarting the interface. However, generally the interface
|
|
||||||
(or cable) is probably broken, especially if the interface is flapping.
|
|
||||||
|
|
||||||
Urgency: If this only affects one server, and you have more than one,
|
|
||||||
identifying and fixing the problem can wait until business hours.
|
|
||||||
If this same problem affects many servers, then you need to take corrective
|
|
||||||
action immediately.
|
|
||||||
* - Network interface card (NIC) is not operating at the expected speed.
|
|
||||||
- The NIC is running at a slower speed than its nominal rated speed.
|
|
||||||
For example, it is running at 100 Mb/s and the NIC is a 1Ge NIC.
|
|
||||||
- 1. Try resetting the interface with:
|
|
||||||
|
|
||||||
.. code::
|
|
||||||
|
|
||||||
sudo ethtool -s eth0 speed 1000
|
|
||||||
|
|
||||||
... and then run:
|
|
||||||
|
|
||||||
.. code::
|
|
||||||
|
|
||||||
sudo lshw -class
|
|
||||||
|
|
||||||
See if size goes to the expected speed. Failing
|
|
||||||
that, check hardware (NIC cable/switch port).
|
|
||||||
|
|
||||||
2. If persistent, consider shutting down the server (especially if a proxy)
|
|
||||||
until the problem is identified and resolved. If you leave this server
|
|
||||||
running it can have a large impact on overall performance.
|
|
||||||
|
|
||||||
Urgency: High
|
|
||||||
* - The interface RX/TX error count is non-zero.
|
|
||||||
- A value of 0 is typical, but counts of 1 or 2 do not indicate a problem.
|
|
||||||
- 1. For low numbers (For example, 1 or 2), you can simply ignore. Numbers in the range
|
|
||||||
3-30 probably indicate that the error count has crept up slowly over a long time.
|
|
||||||
Consider rebooting the server to remove the report from the noise.
|
|
||||||
|
|
||||||
Typically, when a cable or interface is bad, the error count goes to 400+. For example,
|
|
||||||
it stands out. There may be other symptoms such as the interface going up and down or
|
|
||||||
not running at correct speed. A server with a high error count should be watched.
|
|
||||||
|
|
||||||
2. If the error count continue to climb, consider taking the server down until
|
|
||||||
it can be properly investigated. In any case, a reboot should be done to clear
|
|
||||||
the error count.
|
|
||||||
|
|
||||||
Urgency: High, if the error count increasing.
|
|
||||||
|
|
||||||
* - In a swift log you see a message that a process has not replicated in over 24 hours.
|
|
||||||
- The replicator has not successfully completed a run in the last 24 hours.
|
|
||||||
This indicates that the replicator has probably hung.
|
|
||||||
- Use ``swift-init`` to stop and then restart the replicator process.
|
|
||||||
|
|
||||||
Urgency: Low (high if recent adding or replacement of disk drives), however if you
|
|
||||||
recently added or replaced disk drives then you should treat this urgently.
|
|
||||||
* - Container Updater has not run in 4 hour(s).
|
|
||||||
- The service may appear to be running however, it may be hung. Examine their swift
|
|
||||||
logs to see if there are any error messages relating to the container updater. This
|
|
||||||
may potentially explain why the container is not running.
|
|
||||||
- Urgency: Medium
|
|
||||||
This may have been triggered by a recent restart of the rsyslog daemon.
|
|
||||||
Restart the service with:
|
|
||||||
.. code::
|
|
||||||
|
|
||||||
sudo swift-init <service> reload
|
|
||||||
* - Object replicator: Reports the remaining time and that time is more than 100 hours.
|
|
||||||
- Each replication cycle the object replicator writes a log message to its log
|
|
||||||
reporting statistics about the current cycle. This includes an estimate for the
|
|
||||||
remaining time needed to replicate all objects. If this time is longer than
|
|
||||||
100 hours, there is a problem with the replication process.
|
|
||||||
- Urgency: Medium
|
|
||||||
Restart the service with:
|
|
||||||
.. code::
|
|
||||||
|
|
||||||
sudo swift-init object-replicator reload
|
|
||||||
|
|
||||||
Check that the remaining replication time is going down.
|
|
@ -18,16 +18,14 @@ files. For example:
|
|||||||
|
|
||||||
.. code::
|
.. code::
|
||||||
|
|
||||||
$ PDSH_SSH_ARGS_APPEND="-o StrictHostKeyChecking=no" pdsh -l <yourusername> -R ssh
|
$ PDSH_SSH_ARGS_APPEND="-o StrictHostKeyChecking=no" pdsh -l <yourusername> -R ssh \
|
||||||
|
-w <redacted>.68.[4-11,132-139 4-11,132-139],<redacted>.132.[4-11,132-139] \
|
||||||
-w <redacted>.68.[4-11,132-139 4-11,132-139],<redacted>.132.[4-11,132-139
|
'sudo bzgrep -w AUTH_redacted-4962-4692-98fb-52ddda82a5af /var/log/swift/proxy.log*' | dshbak -c
|
||||||
4-11,132-139] 'sudo bzgrep -w AUTH_redacted-4962-4692-98fb-52ddda82a5af /var/log/swift/proxy.log\*'
|
|
||||||
dshbak -c
|
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
\---------------\-
|
----------------
|
||||||
<redacted>.132.6
|
<redacted>.132.6
|
||||||
\---------------\-
|
----------------
|
||||||
Feb 29 08:51:57 sw-aw2az2-proxy011 proxy-server <redacted>.16.132
|
Feb 29 08:51:57 sw-aw2az2-proxy011 proxy-server <redacted>.16.132
|
||||||
<redacted>.66.8 29/Feb/2012/08/51/57 GET /v1.0/AUTH_redacted-4962-4692-98fb-52ddda82a5af
|
<redacted>.66.8 29/Feb/2012/08/51/57 GET /v1.0/AUTH_redacted-4962-4692-98fb-52ddda82a5af
|
||||||
/%3Fformat%3Djson HTTP/1.0 404 - - <REDACTED>_4f4d50c5e4b064d88bd7ab82 - - -
|
/%3Fformat%3Djson HTTP/1.0 404 - - <REDACTED>_4f4d50c5e4b064d88bd7ab82 - - -
|
||||||
@ -37,39 +35,36 @@ This shows a ``GET`` operation on the users account.
|
|||||||
|
|
||||||
.. note::
|
.. note::
|
||||||
|
|
||||||
The HTTP status returned is 404, not found, rather than 500 as reported by the user.
|
The HTTP status returned is 404, Not found, rather than 500 as reported by the user.
|
||||||
|
|
||||||
Using the transaction ID, ``tx429fc3be354f434ab7f9c6c4206c1dc3`` you can
|
Using the transaction ID, ``tx429fc3be354f434ab7f9c6c4206c1dc3`` you can
|
||||||
search the swift object servers log files for this transaction ID:
|
search the swift object servers log files for this transaction ID:
|
||||||
|
|
||||||
.. code::
|
.. code::
|
||||||
|
|
||||||
$ PDSH_SSH_ARGS_APPEND="-o StrictHostKeyChecking=no" pdsh -l <yourusername>
|
$ PDSH_SSH_ARGS_APPEND="-o StrictHostKeyChecking=no" pdsh -l <yourusername> -R ssh \
|
||||||
|
-w <redacted>.72.[4-67|4-67],<redacted>.[4-67|4-67],<redacted>.[4-67|4-67],<redacted>.204.[4-131] \
|
||||||
-R ssh
|
'sudo bzgrep tx429fc3be354f434ab7f9c6c4206c1dc3 /var/log/swift/server.log*' | dshbak -c
|
||||||
-w <redacted>.72.[4-67|4-67],<redacted>.[4-67|4-67],<redacted>.[4-67|4-67],<redacted>.204.[4-131| 4-131]
|
|
||||||
'sudo bzgrep tx429fc3be354f434ab7f9c6c4206c1dc3 /var/log/swift/server.log*'
|
|
||||||
| dshbak -c
|
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
\---------------\-
|
----------------
|
||||||
<redacted>.72.16
|
<redacted>.72.16
|
||||||
\---------------\-
|
----------------
|
||||||
Feb 29 08:51:57 sw-aw2az1-object013 account-server <redacted>.132.6 - -
|
Feb 29 08:51:57 sw-aw2az1-object013 account-server <redacted>.132.6 - -
|
||||||
|
|
||||||
[29/Feb/2012:08:51:57 +0000|] "GET /disk9/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
|
[29/Feb/2012:08:51:57 +0000|] "GET /disk9/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
|
||||||
404 - "tx429fc3be354f434ab7f9c6c4206c1dc3" "-" "-"
|
404 - "tx429fc3be354f434ab7f9c6c4206c1dc3" "-" "-"
|
||||||
|
|
||||||
0.0016 ""
|
0.0016 ""
|
||||||
\---------------\-
|
----------------
|
||||||
<redacted>.31
|
<redacted>.31
|
||||||
\---------------\-
|
----------------
|
||||||
Feb 29 08:51:57 node-az2-object060 account-server <redacted>.132.6 - -
|
Feb 29 08:51:57 node-az2-object060 account-server <redacted>.132.6 - -
|
||||||
[29/Feb/2012:08:51:57 +0000|] "GET /disk6/198875/AUTH_redacted-4962-
|
[29/Feb/2012:08:51:57 +0000|] "GET /disk6/198875/AUTH_redacted-4962-
|
||||||
4692-98fb-52ddda82a5af" 404 - "tx429fc3be354f434ab7f9c6c4206c1dc3" "-" "-" 0.0011 ""
|
4692-98fb-52ddda82a5af" 404 - "tx429fc3be354f434ab7f9c6c4206c1dc3" "-" "-" 0.0011 ""
|
||||||
\---------------\-
|
----------------
|
||||||
<redacted>.204.70
|
<redacted>.204.70
|
||||||
\---------------\-
|
----------------
|
||||||
|
|
||||||
Feb 29 08:51:57 sw-aw2az3-object0067 account-server <redacted>.132.6 - -
|
Feb 29 08:51:57 sw-aw2az3-object0067 account-server <redacted>.132.6 - -
|
||||||
[29/Feb/2012:08:51:57 +0000|] "GET /disk6/198875/AUTH_redacted-4962-
|
[29/Feb/2012:08:51:57 +0000|] "GET /disk6/198875/AUTH_redacted-4962-
|
||||||
@ -79,10 +74,10 @@ search the swift object servers log files for this transaction ID:
|
|||||||
|
|
||||||
The 3 GET operations to 3 different object servers that hold the 3
|
The 3 GET operations to 3 different object servers that hold the 3
|
||||||
replicas of this users account. Each ``GET`` returns a HTTP status of 404,
|
replicas of this users account. Each ``GET`` returns a HTTP status of 404,
|
||||||
not found.
|
Not found.
|
||||||
|
|
||||||
Next, use the ``swift-get-nodes`` command to determine exactly where the
|
Next, use the ``swift-get-nodes`` command to determine exactly where the
|
||||||
users account data is stored:
|
user's account data is stored:
|
||||||
|
|
||||||
.. code::
|
.. code::
|
||||||
|
|
||||||
@ -114,23 +109,23 @@ users account data is stored:
|
|||||||
curl -I -XHEAD "`http://<redacted>.72.27:6002/disk11/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
|
curl -I -XHEAD "`http://<redacted>.72.27:6002/disk11/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
|
||||||
<http://15.185.72.27:6002/disk11/198875/AUTH_db0050ad-4962-4692-98fb-52ddda82a5af>`_ # [Handoff]
|
<http://15.185.72.27:6002/disk11/198875/AUTH_db0050ad-4962-4692-98fb-52ddda82a5af>`_ # [Handoff]
|
||||||
|
|
||||||
ssh <redacted>.31 "ls \-lah /srv/node/disk6/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/"
|
ssh <redacted>.31 "ls -lah /srv/node/disk6/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/"
|
||||||
ssh <redacted>.204.70 "ls \-lah /srv/node/disk6/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/"
|
ssh <redacted>.204.70 "ls -lah /srv/node/disk6/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/"
|
||||||
ssh <redacted>.72.16 "ls \-lah /srv/node/disk9/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/"
|
ssh <redacted>.72.16 "ls -lah /srv/node/disk9/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/"
|
||||||
ssh <redacted>.204.64 "ls \-lah /srv/node/disk11/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/" # [Handoff]
|
ssh <redacted>.204.64 "ls -lah /srv/node/disk11/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/" # [Handoff]
|
||||||
ssh <redacted>.26 "ls \-lah /srv/node/disk11/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/" # [Handoff]
|
ssh <redacted>.26 "ls -lah /srv/node/disk11/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/" # [Handoff]
|
||||||
ssh <redacted>.72.27 "ls \-lah /srv/node/disk11/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/" # [Handoff]
|
ssh <redacted>.72.27 "ls -lah /srv/node/disk11/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/" # [Handoff]
|
||||||
|
|
||||||
Check each of the primary servers, <redacted>.31, <redacted>.204.70 and <redacted>.72.16, for
|
Check each of the primary servers, <redacted>.31, <redacted>.204.70 and <redacted>.72.16, for
|
||||||
this users account. For example on <redacted>.72.16:
|
this users account. For example on <redacted>.72.16:
|
||||||
|
|
||||||
.. code::
|
.. code::
|
||||||
|
|
||||||
$ ls \\-lah /srv/node/disk9/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/
|
$ ls -lah /srv/node/disk9/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/
|
||||||
total 1.0M
|
total 1.0M
|
||||||
drwxrwxrwx 2 swift swift 98 2012-02-23 14:49 .
|
drwxrwxrwx 2 swift swift 98 2012-02-23 14:49 .
|
||||||
drwxrwxrwx 3 swift swift 45 2012-02-03 23:28 ..
|
drwxrwxrwx 3 swift swift 45 2012-02-03 23:28 ..
|
||||||
-rw-\\-----\\- 1 swift swift 15K 2012-02-23 14:49 1846d99185f8a0edaf65cfbf37439696.db
|
-rw------- 1 swift swift 15K 2012-02-23 14:49 1846d99185f8a0edaf65cfbf37439696.db
|
||||||
-rw-rw-rw- 1 swift swift 0 2012-02-23 14:49 1846d99185f8a0edaf65cfbf37439696.db.pending
|
-rw-rw-rw- 1 swift swift 0 2012-02-23 14:49 1846d99185f8a0edaf65cfbf37439696.db.pending
|
||||||
|
|
||||||
So this users account db, an sqlite db is present. Use sqlite to
|
So this users account db, an sqlite db is present. Use sqlite to
|
||||||
@ -155,7 +150,7 @@ checkout the account:
|
|||||||
status_changed_at = 1330001026.00514
|
status_changed_at = 1330001026.00514
|
||||||
metadata =
|
metadata =
|
||||||
|
|
||||||
.. note::
|
.. note:
|
||||||
|
|
||||||
The status is ``DELETED``. So this account was deleted. This explains
|
The status is ``DELETED``. So this account was deleted. This explains
|
||||||
why the GET operations are returning 404, not found. Check the account
|
why the GET operations are returning 404, not found. Check the account
|
||||||
@ -174,14 +169,14 @@ server logs:
|
|||||||
|
|
||||||
.. code::
|
.. code::
|
||||||
|
|
||||||
$ PDSH_SSH_ARGS_APPEND="-o StrictHostKeyChecking=no" pdsh -l <yourusername> -R ssh -w <redacted>.68.[4-11,132-139 4-11,132-
|
$ PDSH_SSH_ARGS_APPEND="-o StrictHostKeyChecking=no" pdsh -l <yourusername> -R ssh \
|
||||||
139],<redacted>.132.[4-11,132-139|4-11,132-139] 'sudo bzgrep AUTH_redacted-4962-4692-98fb-52ddda82a5af /var/log/swift/proxy.log\* | grep -w
|
-w <redacted>.68.[4-11,132-139 4-11,132-139],<redacted>.132.[4-11,132-139|4-11,132-139] \
|
||||||
DELETE |awk "{print \\$3,\\$10,\\$12}"' |- dshbak -c
|
'sudo bzgrep AUTH_redacted-4962-4692-98fb-52ddda82a5af /var/log/swift/proxy.log* \
|
||||||
|
| grep -w DELETE | awk "{print $3,$10,$12}"' |- dshbak -c
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
Feb 23 12:43:46 sw-aw2az2-proxy001 proxy-server 15.203.233.76 <redacted>.66.7 23/Feb/2012/12/43/46 DELETE /v1.0/AUTH_redacted-4962-4692-98fb-
|
Feb 23 12:43:46 sw-aw2az2-proxy001 proxy-server <redacted> <redacted>.66.7 23/Feb/2012/12/43/46 DELETE /v1.0/AUTH_redacted-4962-4692-98fb-
|
||||||
52ddda82a5af/ HTTP/1.0 204 - Apache-HttpClient/4.1.2%20%28java%201.5%29 <REDACTED>_4f458ee4e4b02a869c3aad02 - - -
|
52ddda82a5af/ HTTP/1.0 204 - Apache-HttpClient/4.1.2%20%28java%201.5%29 <REDACTED>_4f458ee4e4b02a869c3aad02 - - -
|
||||||
|
|
||||||
tx4471188b0b87406899973d297c55ab53 - 0.0086
|
tx4471188b0b87406899973d297c55ab53 - 0.0086
|
||||||
|
|
||||||
From this you can see the operation that resulted in the account being deleted.
|
From this you can see the operation that resulted in the account being deleted.
|
||||||
@ -252,8 +247,8 @@ Finally, use ``swift-direct`` to delete the container.
|
|||||||
Procedure: Decommissioning swift nodes
|
Procedure: Decommissioning swift nodes
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
Should Swift nodes need to be decommissioned. For example, where they are being
|
Should Swift nodes need to be decommissioned (e.g.,, where they are being
|
||||||
re-purposed, it is very important to follow the following steps.
|
re-purposed), it is very important to follow the following steps.
|
||||||
|
|
||||||
#. In the case of object servers, follow the procedure for removing
|
#. In the case of object servers, follow the procedure for removing
|
||||||
the node from the rings.
|
the node from the rings.
|
||||||
|
Loading…
x
Reference in New Issue
Block a user