a15d78f652
Added in Steve Deaton's content about troubleshooting a slow cloud. Also, address the broken link. Change-Id: Iadf7d2df62e9d4d77e0c36cb33467af3546bb2cb Closes-Bug: #1251088 Co-Authored-By: Steven Deaton <sdeaton2@gmail.com>
1493 lines
57 KiB
XML
1493 lines
57 KiB
XML
<?xml version="1.0" encoding="UTF-8"?>
|
|
<chapter version="5.0" xml:id="maintenance"
|
|
xmlns="http://docbook.org/ns/docbook"
|
|
xmlns:xlink="http://www.w3.org/1999/xlink"
|
|
xmlns:xi="http://www.w3.org/2001/XInclude"
|
|
xmlns:ns5="http://www.w3.org/2000/svg"
|
|
xmlns:ns4="http://www.w3.org/1998/Math/MathML"
|
|
xmlns:ns3="http://www.w3.org/1999/xhtml"
|
|
xmlns:db="http://docbook.org/ns/docbook">
|
|
<?dbhtml stop-chunking?>
|
|
|
|
<title>Maintenance, Failures, and Debugging</title>
|
|
|
|
<para>Downtime, whether planned or unscheduled, is a certainty when running
|
|
a cloud. This chapter aims to provide useful information for dealing
|
|
proactively, or reactively, with these occurrences.<indexterm
|
|
class="startofrange" xml:id="maindebug">
|
|
<primary>maintenance/debugging</primary>
|
|
|
|
<seealso>troubleshooting</seealso>
|
|
</indexterm></para>
|
|
|
|
<section xml:id="cloud_controller_storage">
|
|
<?dbhtml stop-chunking?>
|
|
|
|
<title>Cloud Controller and Storage Proxy Failures and Maintenance</title>
|
|
|
|
<para>The cloud controller and storage proxy are very similar to each
|
|
other when it comes to expected and unexpected downtime. One of each
|
|
server type typically runs in the cloud, which makes them very noticeable
|
|
when they are not running.</para>
|
|
|
|
<para>For the cloud controller, the good news is if your cloud is using
|
|
the FlatDHCP multi-host HA network mode, existing instances and volumes
|
|
continue to operate while the cloud controller is offline. For the storage
|
|
proxy, however, no storage traffic is possible until it is back up and
|
|
running.</para>
|
|
|
|
<section xml:id="planned_maintenance">
|
|
<?dbhtml stop-chunking?>
|
|
|
|
<title>Planned Maintenance</title>
|
|
|
|
<para>One way to plan for cloud controller or storage proxy maintenance
|
|
is to simply do it off-hours, such as at 1 a.m. or 2 a.m. This strategy
|
|
affects fewer users. If your cloud controller or storage proxy is too
|
|
important to have unavailable at any point in time, you must look into
|
|
high-availability options.<indexterm class="singular">
|
|
<primary>cloud controllers</primary>
|
|
|
|
<secondary>planned maintenance of</secondary>
|
|
</indexterm><indexterm class="singular">
|
|
<primary>maintenance/debugging</primary>
|
|
|
|
<secondary>cloud controller planned maintenance</secondary>
|
|
</indexterm></para>
|
|
</section>
|
|
|
|
<section xml:id="reboot_cloud_controller">
|
|
<?dbhtml stop-chunking?>
|
|
|
|
<title>Rebooting a Cloud Controller or Storage Proxy</title>
|
|
|
|
<para>All in all, just issue the "reboot" command. The operating system
|
|
cleanly shuts down services and then automatically reboots. If you want
|
|
to be very thorough, run your backup jobs just before you
|
|
reboot.<indexterm class="singular">
|
|
<primary>maintenance/debugging</primary>
|
|
|
|
<secondary>rebooting following</secondary>
|
|
</indexterm><indexterm class="singular">
|
|
<primary>storage</primary>
|
|
|
|
<secondary>storage proxy maintenance</secondary>
|
|
</indexterm><indexterm class="singular">
|
|
<primary>reboot</primary>
|
|
|
|
<secondary>cloud controller or storage proxy</secondary>
|
|
</indexterm><indexterm class="singular">
|
|
<primary>cloud controllers</primary>
|
|
|
|
<secondary>rebooting</secondary>
|
|
</indexterm></para>
|
|
</section>
|
|
|
|
<section xml:id="after_a_cc_reboot">
|
|
<?dbhtml stop-chunking?>
|
|
|
|
<title>After a Cloud Controller or Storage Proxy Reboots</title>
|
|
|
|
<para>After a cloud controller reboots, ensure that all required
|
|
services were successfully started. The following commands use
|
|
<code>ps</code> and <code>grep</code> to determine if nova, glance, and
|
|
keystone are currently running:</para>
|
|
|
|
<programlisting><?db-font-size 65%?># ps aux | grep nova-
|
|
# ps aux | grep glance-
|
|
# ps aux | grep keystone
|
|
# ps aux | grep cinder</programlisting>
|
|
|
|
<para>Also check that all services are functioning. The following set of
|
|
commands sources the <code>openrc</code> file, then runs some basic
|
|
glance, nova, and openstack commands. If the commands work as expected,
|
|
you can be confident that those services are in working
|
|
condition:</para>
|
|
|
|
<programlisting><?db-font-size 65%?># source openrc
|
|
# glance index
|
|
# nova list
|
|
# openstack project list</programlisting>
|
|
|
|
<para>For the storage proxy, ensure that the Object Storage service has
|
|
resumed:</para>
|
|
|
|
<programlisting><?db-font-size 65%?># ps aux | grep swift</programlisting>
|
|
|
|
<para>Also check that it is functioning:</para>
|
|
|
|
<programlisting><?db-font-size 65%?># swift stat</programlisting>
|
|
</section>
|
|
|
|
<section xml:id="cc_failure">
|
|
<?dbhtml stop-chunking?>
|
|
|
|
<title>Total Cloud Controller Failure</title>
|
|
|
|
<para>The cloud controller could completely fail if, for example, its
|
|
motherboard goes bad. Users will immediately notice the loss of a cloud
|
|
controller since it provides core functionality to your cloud
|
|
environment. If your infrastructure monitoring does not alert you that
|
|
your cloud controller has failed, your users definitely will.
|
|
Unfortunately, this is a rough situation. The cloud controller is an
|
|
integral part of your cloud. If you have only one controller, you will
|
|
have many missing services if it goes down.<indexterm class="singular">
|
|
<primary>cloud controllers</primary>
|
|
|
|
<secondary>total failure of</secondary>
|
|
</indexterm><indexterm class="singular">
|
|
<primary>maintenance/debugging</primary>
|
|
|
|
<secondary>cloud controller total failure</secondary>
|
|
</indexterm></para>
|
|
|
|
<para>To avoid this situation, create a highly available cloud
|
|
controller cluster. This is outside the scope of this document, but you
|
|
can read more in the <link
|
|
xlink:href="http://docs.openstack.org/ha-guide/index.html">OpenStack High Availability
|
|
Guide</link>.</para>
|
|
|
|
<para>The next best approach is to use a configuration-management tool,
|
|
such as Puppet, to automatically build a cloud controller. This should
|
|
not take more than 15 minutes if you have a spare server available.
|
|
After the controller rebuilds, restore any backups taken (see <xref
|
|
linkend="backup_and_recovery" />).</para>
|
|
|
|
<para>Also, in practice, the <literal>nova-compute</literal> services on
|
|
the compute nodes do not always reconnect cleanly to rabbitmq hosted on
|
|
the controller when it comes back up after a long reboot; a restart on
|
|
the nova services on the compute nodes is required.</para>
|
|
</section>
|
|
</section>
|
|
|
|
<section xml:id="compute_node_failures">
|
|
<?dbhtml stop-chunking?>
|
|
|
|
<title>Compute Node Failures and Maintenance</title>
|
|
|
|
<para>Sometimes a compute node either crashes unexpectedly or requires a
|
|
reboot for maintenance reasons.</para>
|
|
|
|
<section xml:id="planned_maintenance_compute_node">
|
|
<?dbhtml stop-chunking?>
|
|
|
|
<title>Planned Maintenance</title>
|
|
|
|
<para>If you need to reboot a compute node due to planned maintenance
|
|
(such as a software or hardware upgrade), first ensure that all hosted
|
|
instances have been moved off the node. If your cloud is utilizing
|
|
shared storage, use the <code>nova live-migration</code> command. First,
|
|
get a list of instances that need to be moved:<indexterm
|
|
class="singular">
|
|
<primary>compute nodes</primary>
|
|
|
|
<secondary>maintenance</secondary>
|
|
</indexterm><indexterm class="singular">
|
|
<primary>maintenance/debugging</primary>
|
|
|
|
<secondary>compute node planned maintenance</secondary>
|
|
</indexterm></para>
|
|
|
|
<programlisting><?db-font-size 65%?># nova list --host c01.example.com --all-tenants</programlisting>
|
|
|
|
<para>Next, migrate them one by one:</para>
|
|
|
|
<programlisting><?db-font-size 65%?># nova live-migration <uuid> c02.example.com</programlisting>
|
|
|
|
<para>If you are not using shared storage, you can use the
|
|
<code>--block-migrate</code> option:</para>
|
|
|
|
<programlisting><?db-font-size 65%?># nova live-migration --block-migrate <uuid> c02.example.com</programlisting>
|
|
|
|
<para>After you have migrated all instances, ensure that the
|
|
<code>nova-compute</code> service has <phrase
|
|
role="keep-together">stopped</phrase>:</para>
|
|
|
|
<programlisting><?db-font-size 65%?># stop nova-compute</programlisting>
|
|
|
|
<para>If you use a configuration-management system, such as Puppet, that
|
|
ensures the <code>nova-compute</code> service is always running, you can
|
|
temporarily move the <literal>init</literal> files:</para>
|
|
|
|
<programlisting><?db-font-size 65%?># mkdir /root/tmp
|
|
# mv /etc/init/nova-compute.conf /root/tmp
|
|
# mv /etc/init.d/nova-compute /root/tmp</programlisting>
|
|
|
|
<para>Next, shut down your compute node, perform your maintenance, and
|
|
turn the node back on. You can reenable the <code>nova-compute</code>
|
|
service by undoing the previous commands:</para>
|
|
|
|
<programlisting><?db-font-size 65%?># mv /root/tmp/nova-compute.conf /etc/init
|
|
# mv /root/tmp/nova-compute /etc/init.d/</programlisting>
|
|
|
|
<para>Then start the <code>nova-compute</code> service:</para>
|
|
|
|
<programlisting><?db-font-size 65%?># start nova-compute</programlisting>
|
|
|
|
<para>You can now optionally migrate the instances back to their
|
|
original compute node.</para>
|
|
</section>
|
|
|
|
<section xml:id="after_compute_node_reboot">
|
|
<?dbhtml stop-chunking?>
|
|
|
|
<title>After a Compute Node Reboots</title>
|
|
|
|
<para>When you reboot a compute node, first verify that it booted
|
|
successfully. This includes ensuring that the <code>nova-compute</code>
|
|
service is running:<indexterm class="singular">
|
|
<primary>reboot</primary>
|
|
|
|
<secondary>compute node</secondary>
|
|
</indexterm><indexterm class="singular">
|
|
<primary>maintenance/debugging</primary>
|
|
|
|
<secondary>compute node reboot</secondary>
|
|
</indexterm></para>
|
|
|
|
<programlisting><?db-font-size 65%?># ps aux | grep nova-compute
|
|
# status nova-compute</programlisting>
|
|
|
|
<para>Also ensure that it has successfully connected to the AMQP
|
|
server:</para>
|
|
|
|
<programlisting><?db-font-size 65%?># grep AMQP /var/log/nova/nova-compute
|
|
2013-02-26 09:51:31 12427 INFO nova.openstack.common.rpc.common [-] Connected to AMQP server on 199.116.232.36:5672</programlisting>
|
|
|
|
<para>After the compute node is successfully running, you must deal with
|
|
the instances that are hosted on that compute node because none of them
|
|
are running. Depending on your SLA with your users or customers, you
|
|
might have to start each instance and ensure that they start
|
|
correctly.</para>
|
|
</section>
|
|
|
|
<section xml:id="maintenance_instances">
|
|
<?dbhtml stop-chunking?>
|
|
|
|
<title>Instances</title>
|
|
|
|
<para>You can create a list of instances that are hosted on the compute
|
|
node by performing the following command:<indexterm class="singular">
|
|
<primary>instances</primary>
|
|
|
|
<secondary>maintenance/debugging</secondary>
|
|
</indexterm><indexterm class="singular">
|
|
<primary>maintenance/debugging</primary>
|
|
|
|
<secondary>instances</secondary>
|
|
</indexterm></para>
|
|
|
|
<programlisting><?db-font-size 65%?># nova list --host c01.example.com --all-tenants</programlisting>
|
|
|
|
<para>After you have the list, you can use the nova command to start
|
|
each instance:</para>
|
|
|
|
<programlisting><?db-font-size 65%?># nova reboot <uuid></programlisting>
|
|
|
|
<note>
|
|
<para>Any time an instance shuts down unexpectedly, it might have
|
|
problems on boot. For example, the instance might require an
|
|
<code>fsck</code> on the root partition. If this happens, the user can
|
|
use the dashboard VNC console to fix this.</para>
|
|
</note>
|
|
|
|
<para>If an instance does not boot, meaning <code>virsh list</code>
|
|
never shows the instance as even attempting to boot, do the following on
|
|
the compute node:</para>
|
|
|
|
<programlisting><?db-font-size 65%?># tail -f /var/log/nova/nova-compute.log</programlisting>
|
|
|
|
<para>Try executing the <code>nova reboot</code> command again. You
|
|
should see an error message about why the instance was not able to
|
|
boot</para>
|
|
|
|
<para>In most cases, the error is the result of something in libvirt's
|
|
XML file (<code>/etc/libvirt/qemu/instance-xxxxxxxx.xml</code>) that no
|
|
longer exists. You can enforce re-creation of the XML file as well as
|
|
rebooting the instance by running the following command:</para>
|
|
|
|
<programlisting><?db-font-size 65%?># nova reboot --hard <uuid></programlisting>
|
|
</section>
|
|
|
|
<section xml:id="inspect_and_recover_failed_instances">
|
|
<?dbhtml stop-chunking?>
|
|
|
|
<title>Inspecting and Recovering Data from Failed Instances</title>
|
|
|
|
<para>In some scenarios, instances are running but are inaccessible
|
|
through SSH and do not respond to any command. The VNC console could be
|
|
displaying a boot failure or kernel panic error messages. This could be
|
|
an indication of file system corruption on the VM itself. If you need to
|
|
recover files or inspect the content of the instance, qemu-nbd can be
|
|
used to mount the disk.<indexterm class="singular">
|
|
<primary>data</primary>
|
|
|
|
<secondary>inspecting/recovering failed instances</secondary>
|
|
</indexterm></para>
|
|
|
|
<warning>
|
|
<para>If you access or view the user's content and data, get approval
|
|
first!<indexterm class="singular">
|
|
<primary>security issues</primary>
|
|
|
|
<secondary>failed instance data inspection</secondary>
|
|
</indexterm></para>
|
|
</warning>
|
|
|
|
<para>To access the instance's disk
|
|
(<literal>/var/lib/nova/instances/instance-<replaceable>xxxxxx</replaceable>/disk</literal>),
|
|
use the following steps:</para>
|
|
|
|
<orderedlist>
|
|
<listitem>
|
|
<para>Suspend the instance using the <literal>virsh</literal>
|
|
command.</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>Connect the qemu-nbd device to the disk.</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>Mount the qemu-nbd device.</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>Unmount the device after inspecting.</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>Disconnect the qemu-nbd device.</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>Resume the instance.</para>
|
|
</listitem>
|
|
</orderedlist>
|
|
|
|
<para>If you do not follow steps 4 through 6, OpenStack Compute cannot
|
|
manage the instance any longer. It fails to respond to any command
|
|
issued by OpenStack Compute, and it is marked as shut down.</para>
|
|
|
|
<para>Once you mount the disk file, you should be able to access it and
|
|
treat it as a collection of normal directories with files and a
|
|
directory structure. However, we do not recommend that you edit or touch
|
|
any files because this could change the access control lists (ACLs) that
|
|
are used to determine which accounts can perform what operations on
|
|
files and directories. Changing ACLs can make the instance unbootable if
|
|
it is not already.<indexterm class="singular">
|
|
<primary>access control list (ACL)</primary>
|
|
</indexterm></para>
|
|
|
|
<orderedlist>
|
|
<listitem>
|
|
<para>Suspend the instance using the <literal>virsh</literal>
|
|
command, taking note of the internal ID:</para>
|
|
|
|
<programlisting><?db-font-size 65%?># virsh list
|
|
Id Name State
|
|
----------------------------------
|
|
1 instance-00000981 running
|
|
2 instance-000009f5 running
|
|
30 instance-0000274a running
|
|
|
|
# virsh suspend 30
|
|
Domain 30 suspended</programlisting>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>Connect the qemu-nbd device to the disk:</para>
|
|
|
|
<programlisting><?db-font-size 65%?># cd /var/lib/nova/instances/instance-0000274a
|
|
# ls -lh
|
|
total 33M
|
|
-rw-rw---- 1 libvirt-qemu kvm 6.3K Oct 15 11:31 console.log
|
|
-rw-r--r-- 1 libvirt-qemu kvm 33M Oct 15 22:06 disk
|
|
-rw-r--r-- 1 libvirt-qemu kvm 384K Oct 15 22:06 disk.local
|
|
-rw-rw-r-- 1 nova nova 1.7K Oct 15 11:30 libvirt.xml
|
|
# qemu-nbd -c /dev/nbd0 `pwd`/disk</programlisting>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>Mount the qemu-nbd device.</para>
|
|
|
|
<para>The qemu-nbd device tries to export the instance disk's
|
|
different partitions as separate devices. For example, if vda is the
|
|
disk and vda1 is the root partition, qemu-nbd exports the device as
|
|
<literal>/dev/nbd0</literal> and <literal>/dev/nbd0p1</literal>,
|
|
respectively:</para>
|
|
|
|
<programlisting><?db-font-size 65%?># mount /dev/nbd0p1 /mnt/</programlisting>
|
|
|
|
<para>You can now access the contents of <code>/mnt</code>, which
|
|
correspond to the first partition of the instance's disk.</para>
|
|
|
|
<para>To examine the secondary or ephemeral disk, use an alternate
|
|
mount point if you want both primary and secondary drives mounted at
|
|
the same time:</para>
|
|
|
|
<programlisting><?db-font-size 65%?># umount /mnt
|
|
# qemu-nbd -c /dev/nbd1 `pwd`/disk.local
|
|
# mount /dev/nbd1 /mnt/</programlisting>
|
|
|
|
<programlisting><?db-font-size 65%?># ls -lh /mnt/
|
|
total 76K
|
|
lrwxrwxrwx. 1 root root 7 Oct 15 00:44 bin -> usr/bin
|
|
dr-xr-xr-x. 4 root root 4.0K Oct 15 01:07 boot
|
|
drwxr-xr-x. 2 root root 4.0K Oct 15 00:42 dev
|
|
drwxr-xr-x. 70 root root 4.0K Oct 15 11:31 etc
|
|
drwxr-xr-x. 3 root root 4.0K Oct 15 01:07 home
|
|
lrwxrwxrwx. 1 root root 7 Oct 15 00:44 lib -> usr/lib
|
|
lrwxrwxrwx. 1 root root 9 Oct 15 00:44 lib64 -> usr/lib64
|
|
drwx------. 2 root root 16K Oct 15 00:42 lost+found
|
|
drwxr-xr-x. 2 root root 4.0K Feb 3 2012 media
|
|
drwxr-xr-x. 2 root root 4.0K Feb 3 2012 mnt
|
|
drwxr-xr-x. 2 root root 4.0K Feb 3 2012 opt
|
|
drwxr-xr-x. 2 root root 4.0K Oct 15 00:42 proc
|
|
dr-xr-x---. 3 root root 4.0K Oct 15 21:56 root
|
|
drwxr-xr-x. 14 root root 4.0K Oct 15 01:07 run
|
|
lrwxrwxrwx. 1 root root 8 Oct 15 00:44 sbin -> usr/sbin
|
|
drwxr-xr-x. 2 root root 4.0K Feb 3 2012 srv
|
|
drwxr-xr-x. 2 root root 4.0K Oct 15 00:42 sys
|
|
drwxrwxrwt. 9 root root 4.0K Oct 15 16:29 tmp
|
|
drwxr-xr-x. 13 root root 4.0K Oct 15 00:44 usr
|
|
drwxr-xr-x. 17 root root 4.0K Oct 15 00:44 var</programlisting>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>Once you have completed the inspection, unmount the mount
|
|
point and release the qemu-nbd device:</para>
|
|
|
|
<programlisting><?db-font-size 65%?># umount /mnt
|
|
# qemu-nbd -d /dev/nbd0
|
|
/dev/nbd0 disconnected</programlisting>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>Resume the instance using <literal>virsh</literal>:</para>
|
|
|
|
<programlisting><?db-font-size 65%?># virsh list
|
|
Id Name State
|
|
----------------------------------
|
|
1 instance-00000981 running
|
|
2 instance-000009f5 running
|
|
30 instance-0000274a paused
|
|
|
|
# virsh resume 30
|
|
Domain 30 resumed</programlisting>
|
|
</listitem>
|
|
</orderedlist>
|
|
</section>
|
|
|
|
<section xml:id="volumes">
|
|
<?dbhtml stop-chunking?>
|
|
|
|
<title>Volumes</title>
|
|
|
|
<para>If the affected instances also had attached volumes, first
|
|
generate a list of instance and volume UUIDs:<indexterm class="singular">
|
|
<primary>volume</primary>
|
|
|
|
<secondary>maintenance/debugging</secondary>
|
|
</indexterm><indexterm class="singular">
|
|
<primary>maintenance/debugging</primary>
|
|
|
|
<secondary>volumes</secondary>
|
|
</indexterm></para>
|
|
|
|
<programlisting><?db-font-size 65%?>mysql> select nova.instances.uuid as instance_uuid,
|
|
cinder.volumes.id as volume_uuid, cinder.volumes.status,
|
|
cinder.volumes.attach_status, cinder.volumes.mountpoint,
|
|
cinder.volumes.display_name from cinder.volumes
|
|
inner join nova.instances on cinder.volumes.instance_uuid=nova.instances.uuid
|
|
where nova.instances.host = 'c01.example.com';</programlisting>
|
|
|
|
<para>You should see a result similar to the following:</para>
|
|
|
|
<programlisting><?db-font-size 55%?>
|
|
+--------------+------------+-------+--------------+-----------+--------------+
|
|
|instance_uuid |volume_uuid |status |attach_status |mountpoint | display_name |
|
|
+--------------+------------+-------+--------------+-----------+--------------+
|
|
|9b969a05 |1f0fbf36 |in-use |attached |/dev/vdc | test |
|
|
+--------------+------------+-------+--------------+-----------+--------------+
|
|
1 row in set (0.00 sec)</programlisting>
|
|
|
|
<para>Next, manually detach and reattach the volumes, where X is the
|
|
proper mount point:</para>
|
|
|
|
<programlisting><?db-font-size 65%?># nova volume-detach <instance_uuid> <volume_uuid>
|
|
# nova volume-attach <instance_uuid> <volume_uuid> /dev/vdX</programlisting>
|
|
|
|
<para>Be sure that the instance has successfully booted and is at a
|
|
login screen before doing the above.</para>
|
|
</section>
|
|
|
|
<section xml:id="totle_compute_node_failure">
|
|
<?dbhtml stop-chunking?>
|
|
|
|
<title>Total Compute Node Failure</title>
|
|
|
|
<para>Compute nodes can fail the same way a cloud controller can fail. A
|
|
motherboard failure or some other type of hardware failure can cause an
|
|
entire compute node to go offline. When this happens, all instances
|
|
running on that compute node will not be available. Just like with a
|
|
cloud controller failure, if your infrastructure monitoring does not
|
|
detect a failed compute node, your users will notify you because of
|
|
their lost instances.<indexterm class="singular">
|
|
<primary>compute nodes</primary>
|
|
|
|
<secondary>failures</secondary>
|
|
</indexterm><indexterm class="singular">
|
|
<primary>maintenance/debugging</primary>
|
|
|
|
<secondary>compute node total failures</secondary>
|
|
</indexterm></para>
|
|
|
|
<para>If a compute node fails and won't be fixed for a few hours (or at
|
|
all), you can relaunch all instances that are hosted on the failed node
|
|
if you use shared storage for
|
|
<code>/var/lib/nova/instances</code>.</para>
|
|
|
|
<para>To do this, generate a list of instance UUIDs that are hosted on
|
|
the failed node by running the following query on the nova
|
|
database:</para>
|
|
|
|
<programlisting><?db-font-size 65%?>mysql> select uuid from instances where host = \
|
|
'c01.example.com' and deleted = 0;</programlisting>
|
|
|
|
<para>Next, update the nova database to indicate that all instances that
|
|
used to be hosted on c01.example.com are now hosted on
|
|
c02.example.com:</para>
|
|
|
|
<programlisting><?db-font-size 65%?>mysql> update instances set host = 'c02.example.com' where host = \
|
|
'c01.example.com' and deleted = 0;</programlisting>
|
|
|
|
<para>If you're using the Networking service ML2 plug-in, update the
|
|
Networking service database to indicate that all ports that
|
|
used to be hosted on c01.example.com are now hosted on
|
|
c02.example.com:</para>
|
|
|
|
<programlisting><?db-font-size 65%?>mysql> update ml2_port_bindings set host = 'c02.example.com' where host = \
|
|
'c01.example.com';</programlisting>
|
|
|
|
<programlisting><?db-font-size 65%?>mysql> update ml2_port_binding_levels set host = 'c02.example.com' where host = \
|
|
'c01.example.com';</programlisting>
|
|
|
|
<para>After that, use the <literal>nova</literal> command to reboot all
|
|
instances that were on c01.example.com while regenerating their XML
|
|
files at the same time:</para>
|
|
|
|
<programlisting><?db-font-size 65%?># nova reboot --hard <uuid></programlisting>
|
|
|
|
<para>Finally, reattach volumes using the same method described in the
|
|
section <link linkend="volumes">Volumes</link>.</para>
|
|
</section>
|
|
|
|
<section xml:id="var_lib_nova_instances">
|
|
<?dbhtml stop-chunking?>
|
|
|
|
<title>/var/lib/nova/instances</title>
|
|
|
|
<para>It's worth mentioning this directory in the context of failed
|
|
compute nodes. This directory contains the libvirt KVM file-based disk
|
|
images for the instances that are hosted on that compute node. If you
|
|
are not running your cloud in a shared storage environment, this
|
|
directory is unique across all compute nodes.<indexterm class="singular">
|
|
<primary>/var/lib/nova/instances directory</primary>
|
|
</indexterm><indexterm class="singular">
|
|
<primary>maintenance/debugging</primary>
|
|
|
|
<secondary>/var/lib/nova/instances</secondary>
|
|
</indexterm></para>
|
|
|
|
<para><code>/var/lib/nova/instances</code> contains two types of
|
|
directories.</para>
|
|
|
|
<para>The first is the <code>_base</code> directory. This contains all
|
|
the cached base images from glance for each unique image that has been
|
|
launched on that compute node. Files ending in <code>_20</code> (or a
|
|
different number) are the ephemeral base images.</para>
|
|
|
|
<para>The other directories are titled <code>instance-xxxxxxxx</code>.
|
|
These directories correspond to instances running on that compute node.
|
|
The files inside are related to one of the files in the
|
|
<code>_base</code> directory. They're essentially differential-based
|
|
files containing only the changes made from the original
|
|
<code>_base</code> directory.</para>
|
|
|
|
<para>All files and directories in <code>/var/lib/nova/instances</code>
|
|
are uniquely named. The files in _base are uniquely titled for the
|
|
glance image that they are based on, and the directory names
|
|
<code>instance-xxxxxxxx</code> are uniquely titled for that particular
|
|
instance. For example, if you copy all data from
|
|
<code>/var/lib/nova/instances</code> on one compute node to another, you
|
|
do not overwrite any files or cause any damage to images that have the
|
|
same unique name, because they are essentially the same file.</para>
|
|
|
|
<para>Although this method is not documented or supported, you can use
|
|
it when your compute node is permanently offline but you have instances
|
|
locally stored on it.</para>
|
|
</section>
|
|
</section>
|
|
|
|
<section xml:id="storage_node_failures">
|
|
<?dbhtml stop-chunking?>
|
|
|
|
<title>Storage Node Failures and Maintenance</title>
|
|
|
|
<para>Because of the high redundancy of Object Storage, dealing with
|
|
object storage node issues is a lot easier than dealing with compute node
|
|
issues.</para>
|
|
|
|
<section xml:id="reboot_storage_node">
|
|
<?dbhtml stop-chunking?>
|
|
|
|
<title>Rebooting a Storage Node</title>
|
|
|
|
<para>If a storage node requires a reboot, simply reboot it. Requests
|
|
for data hosted on that node are redirected to other copies while the
|
|
server is rebooting.<indexterm class="singular">
|
|
<primary>storage node</primary>
|
|
</indexterm><indexterm class="singular">
|
|
<primary>nodes</primary>
|
|
|
|
<secondary>storage nodes</secondary>
|
|
</indexterm><indexterm class="singular">
|
|
<primary>maintenance/debugging</primary>
|
|
|
|
<secondary>storage node reboot</secondary>
|
|
</indexterm></para>
|
|
</section>
|
|
|
|
<section xml:id="shut_down_storage_node">
|
|
<?dbhtml stop-chunking?>
|
|
|
|
<title>Shutting Down a Storage Node</title>
|
|
|
|
<para>If you need to shut down a storage node for an extended period of
|
|
time (one or more days), consider removing the node from the storage
|
|
ring. For example:<indexterm class="singular">
|
|
<primary>maintenance/debugging</primary>
|
|
|
|
<secondary>storage node shut down</secondary>
|
|
</indexterm></para>
|
|
|
|
<programlisting><?db-font-size 65%?># swift-ring-builder account.builder remove <ip address of storage node>
|
|
# swift-ring-builder container.builder remove <ip address of storage node>
|
|
# swift-ring-builder object.builder remove <ip address of storage node>
|
|
# swift-ring-builder account.builder rebalance
|
|
# swift-ring-builder container.builder rebalance
|
|
# swift-ring-builder object.builder rebalance</programlisting>
|
|
|
|
<para>Next, redistribute the ring files to the other nodes:</para>
|
|
|
|
<programlisting><?db-font-size 65%?># for i in s01.example.com s02.example.com s03.example.com
|
|
> do
|
|
> scp *.ring.gz $i:/etc/swift
|
|
> done</programlisting>
|
|
|
|
<para>These actions effectively take the storage node out of the storage
|
|
cluster.</para>
|
|
|
|
<para>When the node is able to rejoin the cluster, just add it back to
|
|
the ring. The exact syntax you use to add a node to your swift cluster
|
|
with <code>swift-ring-builder</code> heavily depends on the original
|
|
options used when you originally created your cluster. Please refer back
|
|
to those commands.</para>
|
|
</section>
|
|
|
|
<section xml:id="replace_swift_disk">
|
|
<?dbhtml stop-chunking?>
|
|
|
|
<title>Replacing a Swift Disk</title>
|
|
|
|
<para>If a hard drive fails in an Object Storage node, replacing it is
|
|
relatively easy. This assumes that your Object Storage environment is
|
|
configured correctly, where the data that is stored on the failed drive
|
|
is also replicated to other drives in the Object Storage
|
|
environment.<indexterm class="singular">
|
|
<primary>hard drives, replacing</primary>
|
|
</indexterm><indexterm class="singular">
|
|
<primary>maintenance/debugging</primary>
|
|
|
|
<secondary>swift disk replacement</secondary>
|
|
</indexterm></para>
|
|
|
|
<para>This example assumes that <code>/dev/sdb</code> has failed.</para>
|
|
|
|
<para>First, unmount the disk:</para>
|
|
|
|
<programlisting><?db-font-size 65%?># umount /dev/sdb</programlisting>
|
|
|
|
<para>Next, physically remove the disk from the server and replace it
|
|
with a working disk.</para>
|
|
|
|
<para>Ensure that the operating system has recognized the new
|
|
disk:</para>
|
|
|
|
<programlisting><?db-font-size 65%?># dmesg | tail</programlisting>
|
|
|
|
<para>You should see a message about <code>/dev/sdb</code>.</para>
|
|
|
|
<para>Because it is recommended to not use partitions on a swift disk,
|
|
simply format the disk as a whole:</para>
|
|
|
|
<programlisting><?db-font-size 65%?># mkfs.xfs /dev/sdb</programlisting>
|
|
|
|
<para>Finally, mount the disk:</para>
|
|
|
|
<programlisting><?db-font-size 65%?># mount -a</programlisting>
|
|
|
|
<para>Swift should notice the new disk and that no data exists. It then
|
|
begins replicating the data to the disk from the other existing
|
|
replicas.</para>
|
|
</section>
|
|
</section>
|
|
|
|
<section xml:id="complete_failure">
|
|
<?dbhtml stop-chunking?>
|
|
|
|
<title>Handling a Complete Failure</title>
|
|
|
|
<para>A common way of dealing with the recovery from a full system
|
|
failure, such as a power outage of a data center, is to assign each
|
|
service a priority, and restore in order. <xref
|
|
linkend="restor-prior-table" /> shows an example.<indexterm
|
|
class="singular">
|
|
<primary>service restoration</primary>
|
|
</indexterm><indexterm class="singular">
|
|
<primary>maintenance/debugging</primary>
|
|
|
|
<secondary>complete failures</secondary>
|
|
</indexterm></para>
|
|
|
|
<table rules="all" xml:id="restor-prior-table">
|
|
<caption>Example service restoration priority list</caption>
|
|
|
|
<thead>
|
|
<tr>
|
|
<th>Priority</th>
|
|
|
|
<th>Services</th>
|
|
</tr>
|
|
</thead>
|
|
|
|
<tbody>
|
|
<tr>
|
|
<td><para>1</para></td>
|
|
|
|
<td><para>Internal network connectivity</para></td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td><para>2</para></td>
|
|
|
|
<td><para>Backing storage services</para></td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td><para>3</para></td>
|
|
|
|
<td><para>Public network connectivity for user virtual
|
|
machines</para></td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td><para>4</para></td>
|
|
|
|
<td><para><literal>nova-compute</literal>,
|
|
<literal>nova-network</literal>, cinder hosts</para></td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td><para>5</para></td>
|
|
|
|
<td><para>User virtual machines</para></td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td><para>10</para></td>
|
|
|
|
<td><para>Message queue and database services</para></td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td><para>15</para></td>
|
|
|
|
<td><para>Keystone services</para></td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td><para>20</para></td>
|
|
|
|
<td><para><literal>cinder-scheduler</literal></para></td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td><para>21</para></td>
|
|
|
|
<td><para>Image Catalog and Delivery services</para></td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td><para>22</para></td>
|
|
|
|
<td><para><literal>nova-scheduler</literal> services</para></td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td><para>98</para></td>
|
|
|
|
<td><para><literal>cinder-api</literal></para></td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td><para>99</para></td>
|
|
|
|
<td><para><literal>nova-api</literal> services</para></td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td><para>100</para></td>
|
|
|
|
<td><para>Dashboard node</para></td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
|
|
<para>Use this example priority list to ensure that user-affected services
|
|
are restored as soon as possible, but not before a stable environment is
|
|
in place. Of course, despite being listed as a single-line item, each step
|
|
requires significant work. For example, just after starting the database,
|
|
you should check its integrity, or, after starting the nova services, you
|
|
should verify that the hypervisor matches the database and fix any <phrase
|
|
role="keep-together">mismatches</phrase>.</para>
|
|
</section>
|
|
|
|
<section xml:id="config_mgmt">
|
|
<?dbhtml stop-chunking?>
|
|
|
|
<title>Configuration Management</title>
|
|
|
|
<para>Maintaining an OpenStack cloud requires that you manage multiple
|
|
physical servers, and this number might grow over time. Because managing
|
|
nodes manually is error prone, we strongly recommend that you use a
|
|
configuration-management tool. These tools automate the process of
|
|
ensuring that all your nodes are configured properly and encourage you to
|
|
maintain your configuration information (such as packages and
|
|
configuration options) in a version-controlled repository.<indexterm
|
|
class="singular">
|
|
<primary>configuration management</primary>
|
|
</indexterm><indexterm class="singular">
|
|
<primary>networks</primary>
|
|
|
|
<secondary>configuration management</secondary>
|
|
</indexterm><indexterm class="singular">
|
|
<primary>maintenance/debugging</primary>
|
|
|
|
<secondary>configuration management</secondary>
|
|
</indexterm></para>
|
|
|
|
<tip>
|
|
<para>Several configuration-management tools are available, and this
|
|
guide does not recommend a specific one. The two most popular ones in
|
|
the OpenStack community are <link
|
|
xlink:href="https://puppetlabs.com/">Puppet</link>, with available
|
|
<link xlink:href="https://github.com/puppetlabs/puppetlabs-openstack">OpenStack Puppet
|
|
modules</link>; and <link
|
|
xlink:href="http://www.getchef.com/chef/">Chef</link>, with available <link
|
|
xlink:href="https://github.com/opscode/openstack-chef-repo">OpenStack Chef recipes</link>.
|
|
Other newer configuration tools include <link
|
|
xlink:href="https://juju.ubuntu.com/">Juju</link>, <link
|
|
xlink:href="https://www.ansible.com/">Ansible</link>, and <link
|
|
xlink:href="http://www.saltstack.com/">Salt</link>; and more mature
|
|
configuration management tools include <link
|
|
xlink:href="http://cfengine.com/">CFEngine</link> and <link
|
|
xlink:href="http://bcfg2.org/">Bcfg2</link>.</para>
|
|
</tip>
|
|
</section>
|
|
|
|
<section xml:id="hardware">
|
|
<?dbhtml stop-chunking?>
|
|
|
|
<title>Working with Hardware</title>
|
|
|
|
<para>As for your initial deployment, you should ensure that all hardware
|
|
is appropriately burned in before adding it to production. Run software
|
|
that uses the hardware to its limits—maxing out RAM, CPU, disk, and
|
|
network. Many options are available, and normally double as benchmark
|
|
software, so you also get a good idea of the performance of your
|
|
system.<indexterm class="singular">
|
|
<primary>hardware</primary>
|
|
|
|
<secondary>maintenance/debugging</secondary>
|
|
</indexterm><indexterm class="singular">
|
|
<primary>maintenance/debugging</primary>
|
|
|
|
<secondary>hardware</secondary>
|
|
</indexterm></para>
|
|
|
|
<section xml:id="add_new_node">
|
|
<?dbhtml stop-chunking?>
|
|
|
|
<title>Adding a Compute Node</title>
|
|
|
|
<para>If you find that you have reached or are reaching the capacity
|
|
limit of your computing resources, you should plan to add additional
|
|
compute nodes. Adding more nodes is quite easy. The process for adding
|
|
compute nodes is the same as when the initial compute nodes were
|
|
deployed to your cloud: use an automated deployment system to bootstrap
|
|
the bare-metal server with the operating system and then have a
|
|
configuration-management system install and configure OpenStack Compute.
|
|
Once the Compute service has been installed and configured in the same
|
|
way as the other compute nodes, it automatically attaches itself to the
|
|
cloud. The cloud controller notices the new node(s) and begins
|
|
scheduling instances to launch there.<indexterm class="singular">
|
|
<primary>cloud controllers</primary>
|
|
|
|
<secondary>new compute nodes and</secondary>
|
|
</indexterm><indexterm class="singular">
|
|
<primary>nodes</primary>
|
|
|
|
<secondary>adding</secondary>
|
|
</indexterm><indexterm class="singular">
|
|
<primary>compute nodes</primary>
|
|
|
|
<secondary>adding</secondary>
|
|
</indexterm></para>
|
|
|
|
<para>If your OpenStack Block Storage nodes are separate from your
|
|
compute nodes, the same procedure still applies because the same queuing
|
|
and polling system is used in both services.</para>
|
|
|
|
<para>We recommend that you use the same hardware for new compute and
|
|
block storage nodes. At the very least, ensure that the CPUs are similar
|
|
in the compute nodes to not break live migration.</para>
|
|
</section>
|
|
|
|
<section xml:id="add_new_object_node">
|
|
<?dbhtml stop-chunking?>
|
|
|
|
<title>Adding an Object Storage Node</title>
|
|
|
|
<para>Adding a new object storage node is different from adding compute
|
|
or block storage nodes. You still want to initially configure the server
|
|
by using your automated deployment and configuration-management systems.
|
|
After that is done, you need to add the local disks of the object
|
|
storage node into the object storage ring. The exact command to do this
|
|
is the same command that was used to add the initial disks to the ring.
|
|
Simply rerun this command on the object storage proxy server for all
|
|
disks on the new object storage node. Once this has been done, rebalance
|
|
the ring and copy the resulting ring files to the other storage
|
|
nodes.<indexterm class="singular">
|
|
<primary>Object Storage</primary>
|
|
|
|
<secondary>adding nodes</secondary>
|
|
</indexterm></para>
|
|
|
|
<note>
|
|
<para>If your new object storage node has a different number of disks
|
|
than the original nodes have, the command to add the new node is
|
|
different from the original commands. These parameters vary from
|
|
environment to environment.</para>
|
|
</note>
|
|
</section>
|
|
|
|
<section xml:id="replace_components">
|
|
<?dbhtml stop-chunking?>
|
|
|
|
<title>Replacing Components</title>
|
|
|
|
<para>Failures of hardware are common in large-scale deployments such as
|
|
an infrastructure cloud. Consider your processes and balance time saving
|
|
against availability. For example, an Object Storage cluster can easily
|
|
live with dead disks in it for some period of time if it has sufficient
|
|
capacity. Or, if your compute installation is not full, you could
|
|
consider live migrating instances off a host with a RAM failure until
|
|
you have time to deal with the problem.</para>
|
|
</section>
|
|
</section>
|
|
|
|
<section xml:id="databases">
|
|
<?dbhtml stop-chunking?>
|
|
|
|
<title>Databases</title>
|
|
|
|
<para>Almost all OpenStack components have an underlying database to store
|
|
persistent information. Usually this database is MySQL. Normal MySQL
|
|
administration is applicable to these databases. OpenStack does not
|
|
configure the databases out of the ordinary. Basic administration includes
|
|
performance tweaking, high availability, backup, recovery, and repairing.
|
|
For more information, see a standard MySQL administration guide.<indexterm
|
|
class="singular">
|
|
<primary>databases</primary>
|
|
|
|
<secondary>maintenance/debugging</secondary>
|
|
</indexterm><indexterm class="singular">
|
|
<primary>maintenance/debugging</primary>
|
|
|
|
<secondary>databases</secondary>
|
|
</indexterm></para>
|
|
|
|
<para>You can perform a couple of tricks with the database to either more
|
|
quickly retrieve information or fix a data inconsistency error—for
|
|
example, an instance was terminated, but the status was not updated in the
|
|
database. These tricks are discussed throughout this book.</para>
|
|
|
|
<section xml:id="database_connect">
|
|
<?dbhtml stop-chunking?>
|
|
|
|
<title>Database Connectivity</title>
|
|
|
|
<para>Review the component's configuration file to see how each
|
|
OpenStack component accesses its corresponding database. Look for either
|
|
<code>sql_connection</code> or simply <code>connection</code>. The
|
|
following command uses <code>grep</code> to display the SQL connection
|
|
string for nova, glance, cinder, and keystone:</para>
|
|
|
|
<programlisting><?db-font-size 65%?># <emphasis role="bold">grep -hE "connection ?=" /etc/nova/nova.conf /etc/glance/glance-*.conf
|
|
/etc/cinder/cinder.conf /etc/keystone/keystone.conf</emphasis>
|
|
sql_connection = mysql+pymysql://nova:nova@cloud.alberta.sandbox.cybera.ca/nova
|
|
sql_connection = mysql+pymysql://glance:password@cloud.example.com/glance
|
|
sql_connection = mysql+pymysql://glance:password@cloud.example.com/glance
|
|
sql_connection = mysql+pymysql://cinder:password@cloud.example.com/cinder
|
|
connection = mysql+pymysql://keystone_admin:password@cloud.example.com/keystone</programlisting>
|
|
|
|
<para>The connection strings take this format:</para>
|
|
|
|
<programlisting><?db-font-size 65%?>mysql+pymysql:// <username> : <password> @ <hostname> / <database name></programlisting>
|
|
</section>
|
|
|
|
<section xml:id="perf_and_opt">
|
|
<?dbhtml stop-chunking?>
|
|
|
|
<title>Performance and Optimizing</title>
|
|
|
|
<para>As your cloud grows, MySQL is utilized more and more. If you
|
|
suspect that MySQL might be becoming a bottleneck, you should start
|
|
researching MySQL optimization. The MySQL manual has an entire section
|
|
dedicated to this topic: <link
|
|
xlink:href="http://dev.mysql.com/doc/refman/5.5/en/optimize-overview.html">Optimization
|
|
Overview</link>.</para>
|
|
</section>
|
|
</section>
|
|
|
|
<section xml:id="hdmy">
|
|
<?dbhtml stop-chunking?>
|
|
|
|
<title>HDWMY</title>
|
|
|
|
<para>Here's a quick list of various to-do items for each hour, day, week,
|
|
month, and year. Please note that these tasks are neither required nor
|
|
definitive but helpful ideas:<indexterm class="singular">
|
|
<primary>maintenance/debugging</primary>
|
|
|
|
<secondary>schedule of tasks</secondary>
|
|
</indexterm></para>
|
|
|
|
<section xml:id="hourly">
|
|
<?dbhtml stop-chunking?>
|
|
|
|
<title>Hourly</title>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>Check your monitoring system for alerts and act on
|
|
them.</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>Check your ticket queue for new tickets.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</section>
|
|
|
|
<section xml:id="daily">
|
|
<?dbhtml stop-chunking?>
|
|
|
|
<title>Daily</title>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>Check for instances in a failed or weird state and investigate
|
|
why.</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>Check for security patches and apply them as needed.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</section>
|
|
|
|
<section xml:id="weekly">
|
|
<?dbhtml stop-chunking?>
|
|
|
|
<title>Weekly</title>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>Check cloud usage: <itemizedlist>
|
|
<listitem>
|
|
<para>User quotas</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>Disk space</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>Image usage</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>Large instances</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>Network usage (bandwidth and IP usage)</para>
|
|
</listitem>
|
|
</itemizedlist></para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>Verify your alert mechanisms are still working.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</section>
|
|
|
|
<section xml:id="monthly">
|
|
<?dbhtml stop-chunking?>
|
|
|
|
<title>Monthly</title>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>Check usage and trends over the past month.</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>Check for user accounts that should be removed.</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>Check for operator accounts that should be removed.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</section>
|
|
|
|
<section xml:id="quarterly">
|
|
<?dbhtml stop-chunking?>
|
|
|
|
<title>Quarterly</title>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>Review usage and trends over the past quarter.</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>Prepare any quarterly reports on usage and statistics.</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>Review and plan any necessary cloud additions.</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>Review and plan any major OpenStack upgrades.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</section>
|
|
|
|
<section xml:id="semiannual">
|
|
<?dbhtml stop-chunking?>
|
|
|
|
<title>Semiannually</title>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>Upgrade OpenStack.</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>Clean up after an OpenStack upgrade (any unused or new
|
|
services to be aware of?).</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</section>
|
|
</section>
|
|
|
|
<section xml:id="broken_component">
|
|
<?dbhtml stop-chunking?>
|
|
|
|
<title>Determining Which Component Is Broken</title>
|
|
|
|
<para>OpenStack's collection of different components interact with each
|
|
other strongly. For example, uploading an image requires interaction from
|
|
<code>nova-api</code>, <code>glance-api</code>,
|
|
<code>glance-registry</code>, keystone, and potentially
|
|
<code>swift-proxy</code>. As a result, it is sometimes difficult to
|
|
determine exactly where problems lie. Assisting in this is the purpose of
|
|
this section.<indexterm class="singular">
|
|
<primary>logging/monitoring</primary>
|
|
|
|
<secondary>tailing logs</secondary>
|
|
</indexterm><indexterm class="singular">
|
|
<primary>maintenance/debugging</primary>
|
|
|
|
<secondary>determining component affected</secondary>
|
|
</indexterm></para>
|
|
|
|
<section xml:id="tailing_logs">
|
|
<?dbhtml stop-chunking?>
|
|
|
|
<title>Tailing Logs</title>
|
|
|
|
<para>The first place to look is the log file related to the command you
|
|
are trying to run. For example, if <code>nova list</code> is failing,
|
|
try tailing a nova log file and running the command again:<indexterm
|
|
class="singular">
|
|
<primary>tailing logs</primary>
|
|
</indexterm></para>
|
|
|
|
<para>Terminal 1:</para>
|
|
|
|
<programlisting><?db-font-size 65%?># tail -f /var/log/nova/nova-api.log</programlisting>
|
|
|
|
<para>Terminal 2:</para>
|
|
|
|
<programlisting><?db-font-size 65%?># nova list</programlisting>
|
|
|
|
<para>Look for any errors or traces in the log file. For more
|
|
information, see <xref linkend="logging_monitoring" />.</para>
|
|
|
|
<para>If the error indicates that the problem is with another component,
|
|
switch to tailing that component's log file. For example, if nova cannot
|
|
access glance, look at the <literal>glance-api</literal> log:</para>
|
|
|
|
<para>Terminal 1:</para>
|
|
|
|
<programlisting><?db-font-size 65%?># tail -f /var/log/glance/api.log</programlisting>
|
|
|
|
<para>Terminal 2:</para>
|
|
|
|
<programlisting><?db-font-size 65%?># nova list</programlisting>
|
|
|
|
<para>Wash, rinse, and repeat until you find the core cause of the
|
|
problem.</para>
|
|
</section>
|
|
|
|
<section xml:id="daemons_cli">
|
|
<?dbhtml stop-chunking?>
|
|
|
|
<title>Running Daemons on the CLI</title>
|
|
|
|
<para>Unfortunately, sometimes the error is not apparent from the log
|
|
files. In this case, switch tactics and use a different command; maybe
|
|
run the service directly on the command line. For example, if the
|
|
<code>glance-api</code> service refuses to start and stay running, try
|
|
launching the daemon from the command line:<indexterm class="singular">
|
|
<primary>daemons</primary>
|
|
|
|
<secondary>running on CLI</secondary>
|
|
</indexterm><indexterm class="singular">
|
|
<primary>Command-line interface (CLI)</primary>
|
|
</indexterm></para>
|
|
|
|
<programlisting><?db-font-size 65%?># sudo -u glance -H glance-api</programlisting>
|
|
|
|
<para>This might print the error and cause of the problem.<note>
|
|
<para>The <literal>-H</literal> flag is required when running the
|
|
daemons with sudo because some daemons will write files relative to
|
|
the user's home directory, and this write may fail if
|
|
<literal>-H</literal> is left off.</para>
|
|
</note></para>
|
|
|
|
<sidebar>
|
|
<title>Example of Complexity</title>
|
|
|
|
<para>One morning, a compute node failed to run any instances. The log
|
|
files were a bit vague, claiming that a certain instance was unable to
|
|
be started. This ended up being a red herring because the instance was
|
|
simply the first instance in alphabetical order, so it was the first
|
|
instance that <literal>nova-compute</literal> would touch.</para>
|
|
|
|
<para>Further troubleshooting showed that libvirt was not running at
|
|
all. This made more sense. If libvirt wasn't running, then no instance
|
|
could be virtualized through KVM. Upon trying to start libvirt, it
|
|
would silently die immediately. The libvirt logs did not explain
|
|
why.</para>
|
|
|
|
<para>Next, the <code>libvirtd</code> daemon was run on the command
|
|
line. Finally a helpful error message: it could not connect to d-bus.
|
|
As ridiculous as it sounds, libvirt, and thus
|
|
<code>nova-compute</code>, relies on d-bus and somehow d-bus crashed.
|
|
Simply starting d-bus set the entire chain back on track, and soon
|
|
everything was back up and running.</para>
|
|
</sidebar>
|
|
</section>
|
|
</section>
|
|
|
|
<?hard-pagebreak ?>
|
|
|
|
<section xml:id="runningslow">
|
|
<?dbhtml stop-chunking?>
|
|
|
|
<title>What to do when things are running slowly</title>
|
|
|
|
<para>
|
|
When you are getting slow responses from various services, it can be
|
|
hard to know where to start looking. The first thing to check is the
|
|
extent of the slowness: is it specific to a single service, or varied
|
|
among different services? If your problem is isolated to a specific
|
|
service, it can temporarily be fixed by restarting the service, but that
|
|
is often only a fix for the symptom and not the actual problem.
|
|
</para>
|
|
|
|
<para>
|
|
This is a collection of ideas from experienced operators on common
|
|
things to look at that may be the cause of slowness. It is not, however,
|
|
designed to be an exhaustive list.
|
|
</para>
|
|
|
|
<section xml:id="runningslow_keystone">
|
|
<?dbhtml stop-chunking?>
|
|
<title>OpenStack Identity service</title>
|
|
<para>
|
|
If OpenStack Identity is responding slowly, it could be due to the
|
|
token table getting large. This can be fixed by running the
|
|
<command>keystone-manage token_flush</command> command.
|
|
</para>
|
|
<para>
|
|
Additionally, for Identity-related issues, try the tips in
|
|
<xref linkend="runningslow_sql" />.
|
|
</para>
|
|
</section>
|
|
|
|
<section xml:id="runningslow_glance">
|
|
<?dbhtml stop-chunking?>
|
|
<title>OpenStack Image service</title>
|
|
<para>
|
|
OpenStack Image service can be slowed down by things related to the
|
|
Identity service, but the Image service itself can be slowed down if
|
|
connectivity to the back-end storage in use is slow or otherwise
|
|
problematic. For example, your back-end NFS server might have gone
|
|
down.
|
|
</para>
|
|
</section>
|
|
|
|
<section xml:id="runningslow_cinder">
|
|
<?dbhtml stop-chunking?>
|
|
<title>OpenStack Block Storage service</title>
|
|
<para>
|
|
OpenStack Block Storage service is similar to the Image service, so
|
|
start by checking Identity-related services, and the back-end storage.
|
|
Additionally, both the Block Storage and Image services rely on AMQP
|
|
and SQL functionality, so consider these when debugging.
|
|
</para>
|
|
</section>
|
|
|
|
<section xml:id="runningslow_nova">
|
|
<?dbhtml stop-chunking?>
|
|
<title>OpenStack Compute service</title>
|
|
<para>
|
|
Services related to OpenStack Compute are normally fairly fast and
|
|
rely on a couple of backend services: Identity for authentication and
|
|
authorization), and AMQP for interoperability. Any slowness related to
|
|
services is normally related to one of these. Also, as with all other
|
|
services, SQL is used extensively.
|
|
</para>
|
|
</section>
|
|
|
|
<section xml:id="runningslow_neutron">
|
|
<?dbhtml stop-chunking?>
|
|
<title>OpenStack Networking service</title>
|
|
<para>
|
|
Slowness in the OpenStack Networking service can be caused by services
|
|
that it relies upon, but it can also be related to either physical or
|
|
virtual networking. For example: network namespaces that do not exist
|
|
or are not tied to interfaces correctly; DHCP daemons that have hung
|
|
or are not running; a cable being physically disconnected; a switch
|
|
not being configured correctly. When debugging Networking service
|
|
problems, begin by verifying all physical networking functionality
|
|
(switch configuration, physical cabling, etc.). After the physical
|
|
networking is verified, check to be sure all of the Networking
|
|
services are running (neutron-server, neutron-dhcp-agent, etc.), then
|
|
check on AMQP and SQL back ends.
|
|
</para>
|
|
</section>
|
|
|
|
<section xml:id="runningslow_amqp">
|
|
<?dbhtml stop-chunking?>
|
|
<title>AMQP broker</title>
|
|
<para>
|
|
Regardless of which AMQP broker you use, such as RabbitMQ, there are
|
|
common issues which not only slow down operations, but can also cause
|
|
real problems. Sometimes messages queued for services stay on the
|
|
queues and are not consumed. This can be due to dead or stagnant
|
|
services and can be commonly cleared up by either restarting the
|
|
AMQP-related services or the OpenStack service in question.
|
|
</para>
|
|
</section>
|
|
|
|
<section xml:id="runningslow_sql">
|
|
<?dbhtml stop-chunking?>
|
|
<title>SQL back end</title>
|
|
<para>
|
|
Whether you use SQLite or an RDBMS (such as MySQL), SQL
|
|
interoperability is essential to a functioning OpenStack environment.
|
|
A large or fragmented SQLite file can cause slowness when using files
|
|
as a back end. A locked or long-running query can cause delays for
|
|
most RDBMS services. In this case, do not kill the query immediately,
|
|
but look into it to see if it is a problem with something that is
|
|
hung, or something that is just taking a long time to run and needs to
|
|
finish on its own. The administration of an RDBMS is outside the scope
|
|
of this document, but it should be noted that a properly functioning
|
|
RDBMS is essential to most OpenStack services.
|
|
</para>
|
|
</section>
|
|
|
|
</section>
|
|
|
|
<?hard-pagebreak ?>
|
|
|
|
<section xml:id="uninstalling">
|
|
<?dbhtml stop-chunking?>
|
|
|
|
<title>Uninstalling</title>
|
|
|
|
<para>While we'd always recommend using your automated deployment system
|
|
to reinstall systems from scratch, sometimes you do need to remove
|
|
OpenStack from a system the hard way. Here's how:<indexterm
|
|
class="singular">
|
|
<primary>uninstall operation</primary>
|
|
</indexterm><indexterm class="singular">
|
|
<primary>maintenance/debugging</primary>
|
|
|
|
<secondary>uninstalling</secondary>
|
|
</indexterm></para>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>Remove all packages.</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>Remove remaining files.</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>Remove databases.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>These steps depend on your underlying distribution, but in general
|
|
you should be looking for "purge" commands in your package manager, like
|
|
<literal>aptitude purge ~c $package</literal>. Following this, you can
|
|
look for orphaned files in the directories referenced throughout this
|
|
guide. To uninstall the database properly, refer to the manual appropriate
|
|
for the product in use.<indexterm class="endofrange"
|
|
startref="maindebug" /></para>
|
|
</section>
|
|
</chapter>
|