
This is the operational procedures guide that HPE used to operate and monitor their public Swift systems. It has been made publicly available. Change-Id: Iefb484893056d28beb69265d99ba30c3c84add2b
8.3 KiB
Further issues and resolutions
Note
The urgency levels in each Action column indicates whether or not it is required to take immediate action, or if the problem can be worked on during business hours.
Scenario | Description | Action |
---|---|---|
|
The |
Check networks. Do a
Urgency: If there are other indications that your system is slow, you should treat this as an urgent problem. |
Swift process is not running. |
You can use |
Run this command: .. code:
Examine messages in the swift log files to see if there are any error
messages related to any of the swift processes since the time you ran
the Take any corrective actions that seem necessary. Urgency: If this only affects one server, and you have more than one, identifying and fixing the problem can wait until business hours. If this same problem affects many servers, then you need to take corrective action immediately. |
ntpd is not running. | NTP is not running. | Configure and start NTP. Urgency: For proxy servers, this is vital. |
Host clock is not syncd to an NTP server. | Node time settings does not match NTP server time. This may take some time to sync after a reboot. | Assuming NTP is configured and running, you have to wait until the times sync. |
A swift process has hundreds, to thousands of open file descriptors. |
May happen to any of the swift processes. Known to have happened
with a |
Restart the swift processes on the affected node:
|
A swift process is not owned by the swift user. | If the UID of the swift user has changed, then the processes might not be owned by that UID. | Urgency: If this only affects one server, and you have more than one, identifying and fixing the problem can wait until business hours. If this same problem affects many servers, then you need to take corrective action immediately. |
Object account or container files not owned by swift. |
This typically happens if during a reinstall or a re-image of a server that the UID of the swift user was changed. The data files in the object account and container directories are owned by the original swift UID. As a result, the current swift user does not own these files. |
Correct the UID of the swift user to reflect that of the original UID. An alternate action is to change the ownership of every file on all file systems. This alternate action is often impractical and will take considerable time. Urgency: If this only affects one server, and you have more than one, identifying and fixing the problem can wait until business hours. If this same problem affects many servers, then you need to take corrective action immediately. |
A disk drive has a high IO wait or service time. |
If high wait IO times are seen for a single disk, then the disk drive is the problem. If most/all devices are slow, the controller is probably the source of the problem. The controller cache may also be miss configured – which will cause similar long wait or service times. |
As a first step, if your controllers have a cache, check that it is enabled and their battery/capacitor is working. Second, reboot the server. If problem persists, file a DC ticket to have the drive or controller replaced. See Diagnose: Slow disk devices on how to check the drive wait or service times. Urgency: Medium |
The network interface is not up. |
Use the |
You can try restarting the interface. However, generally the interface (or cable) is probably broken, especially if the interface is flapping. Urgency: If this only affects one server, and you have more than one, identifying and fixing the problem can wait until business hours. If this same problem affects many servers, then you need to take corrective action immediately. |
Network interface card (NIC) is not operating at the expected speed. |
The NIC is running at a slower speed than its nominal rated speed. For example, it is running at 100 Mb/s and the NIC is a 1Ge NIC. |
... and then run:
See if size goes to the expected speed. Failing that, check hardware (NIC cable/switch port).
Urgency: High |
The interface RX/TX error count is non-zero. | A value of 0 is typical, but counts of 1 or 2 do not indicate a problem. |
|
In a swift log you see a message that a process has not replicated in over 24 hours. |
The replicator has not successfully completed a run in the last 24 hours. This indicates that the replicator has probably hung. |
Use Urgency: Low (high if recent adding or replacement of disk drives), however if you recently added or replaced disk drives then you should treat this urgently. |
Container Updater has not run in 4 hour(s). |
The service may appear to be running however, it may be hung. Examine their swift logs to see if there are any error messages relating to the container updater. This may potentially explain why the container is not running. |
Urgency: Medium This may have been triggered by a recent restart of the rsyslog daemon. Restart the service with: .. code:
|
Object replicator: Reports the remaining time and that time is more than 100 hours. |
Each replication cycle the object replicator writes a log message to its log reporting statistics about the current cycle. This includes an estimate for the remaining time needed to replicate all objects. If this time is longer than 100 hours, there is a problem with the replication process. |
Urgency: Medium Restart the service with: .. code:
Check that the remaining replication time is going down. |