operations-guide/doc/openstack-ops/ch_ops_maintenance.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE chapter [
<!-- Some useful entities borrowed from HTML -->
<!ENTITY ndash  "&#x2013;">
<!ENTITY mdash  "&#x2014;">
<!ENTITY hellip "&#x2026;">
<!ENTITY plusmn "&#xB1;">
]>
<chapter xmlns="http://docbook.org/ns/docbook"
    xmlns:xi="http://www.w3.org/2001/XInclude"
    xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0"
    xml:id="maintenance">
    <?dbhtml stop-chunking?>
    <title>Maintenance, Failures, and Debugging</title>
    <para>Downtime, whether planned or unscheduled, is a certainty
        when running a cloud. This chapter aims to provide useful
        information for dealing proactively, or reactively with these
        occurrences.</para>
    <section xml:id="cloud_controller_storage">
        <?dbhtml stop-chunking?>
        <title>Cloud Controller and Storage Proxy Failures and
            Maintenance</title>
        <para>The cloud controller and storage proxy are very similar
            to each other when it comes to expected and unexpected
            downtime. One of each server type typically runs in the
            cloud, which makes them very noticeable when they are not
            running.</para>
        <para>For the cloud controller, the good news is if your cloud
            is using the FlatDHCP multi-host HA network mode, existing
            instances and volumes continue to operate while the cloud
            controller is offline. However for the storage proxy, no
            storage traffic is possible until it is back up and
            running.</para>
        <section xml:id="planned_maintenance">
            <?dbhtml stop-chunking?>
            <title>Planned Maintenance</title>
            <para>One way to plan for cloud controller or storage
                proxy maintenance is to simply do it off-hours, such
                as at 1 or 2 A.M.. This strategy impacts fewer users.
                If your cloud controller or storage proxy is too
                important to have unavailable at any point in time,
                you must look into High Availability options.</para>
        </section>
        <section xml:id="reboot_cloud_controller">
            <?dbhtml stop-chunking?>
            <title>Rebooting a cloud controller or Storage
                Proxy</title>
            <para>All in all, just issue the "reboot" command. The
                operating system cleanly shuts services down and then
                automatically reboots. If you want to be very
                thorough, run your backup jobs just before you
                reboot.</para>
        </section>
        <section xml:id="after_a_cc_reboot">
            <?dbhtml stop-chunking?>
            <title>After a Cloud Controller or Storage Proxy
                Reboots</title>
            <para>After a cloud controller reboots, ensure that all
                required services were successfully started. The
                following commands use <code>ps</code> and
                    <code>grep</code> to determine if nova, glance,
                and keystone are currently running:</para>
            <programlisting><?db-font-size 65%?># ps aux | grep nova-
# ps aux | grep glance-
# ps aux | grep keystone
# ps aux | grep cinder</programlisting>
            <para>Also check that all services are functioning. The
                following set of commands sources the
                    <code>openrc</code> file, then runs some basic
                glance, nova, and keystone commands. If the commands
                work as expected, you can be confident that those
                services are in working condition:</para>
            <programlisting><?db-font-size 65%?># source openrc
# glance index
# nova list
# keystone tenant-list</programlisting>
            <para>For the storage proxy, ensure that the Object
                Storage service has resumed:</para>
            <programlisting><?db-font-size 65%?># ps aux | grep swift</programlisting>
            <para>Also check that it is functioning:</para>
            <programlisting><?db-font-size 65%?># swift stat</programlisting>
        </section>
        <section xml:id="cc_failure">
            <?dbhtml stop-chunking?>
            <title>Total Cloud Controller Failure</title>
            <para>The cloud controller could completely fail if, for
                example, its motherboard goes bad. Users will
                immediately notice the loss of a cloud controller
                since it provides core functionality to your cloud
                environment. If your infrastructure monitoring does
                not alert you that your cloud controller has failed,
                your users definitely will. Unfortunately, this is a
                rough situation. The cloud controller is an integral
                part of your cloud. If you have only one controller,
                you will have many missing services if it goes
                down.</para>
            <para>To avoid this situation, create a highly available
                cloud controller cluster. This is outside the scope of
                this document, but you can read more in the draft
                    <link
                    xlink:title="OpenStack High Availability Guide"
                    xlink:href="http://docs.openstack.org/trunk/openstack-ha/content/ch-intro.html"
                    >OpenStack High Availability Guide</link>
                (http://docs.openstack.org/trunk/openstack-ha/content/ch-intro.html).</para>
            <para>The next best way is to use a configuration
                management tool such as Puppet to automatically build
                a cloud controller. This should not take more than 15
                minutes if you have a spare server available. After
                the controller rebuilds, restore any backups taken
                (see the <emphasis role="bold">Backup and
                    Recovery</emphasis> chapter).</para>
            <para>Also, in practice, sometimes the nova-compute
                services on the compute nodes do not reconnect cleanly
                to rabbitmq hosted on the controller when it comes
                back up after a long reboot and a restart on the nova
                services on the compute nodes is required.</para>
        </section>
    </section>
    <section xml:id="compute_node_failures">
        <?dbhtml stop-chunking?>
        <title>Compute Node Failures and Maintenance</title>
        <para>Sometimes a compute node either crashes unexpectedly or
            requires a reboot for maintenance reasons.</para>
        <section xml:id="planned_maintenance_compute_node">
            <?dbhtml stop-chunking?>
            <title>Planned Maintenance</title>
            <para>If you need to reboot a compute node due to planned
                maintenance (such as a software or hardware upgrade),
                first ensure that all hosted instances have been moved
                off of the node. If your cloud is utilizing shared
                storage, use the <code>nova live-migration</code>
                command. First, get a list of instances that need to
                be moved:</para>
            <programlisting><?db-font-size 65%?># nova list --host c01.example.com --all-tenants</programlisting>
            <para>Next, migrate them one by one:</para>
            <programlisting><?db-font-size 65%?># nova live-migration &lt;uuid&gt; c02.example.com</programlisting>
            <para>If you are not using shared storage, you can use the
                    <code>--block-migrate</code> option:</para>
            <programlisting><?db-font-size 65%?># nova live-migration --block-migrate &lt;uuid&gt; c02.example.com</programlisting>
            <para>After you have migrated all instances, ensure the
                    <code>nova-compute</code> service has
                stopped:</para>
            <programlisting><?db-font-size 65%?># stop nova-compute</programlisting>
            <para>If you use a configuration management system, such
                as Puppet, that ensures the <code>nova-compute</code>
                service is always running, you can temporarily move
                the init files:</para>
            <programlisting><?db-font-size 65%?># mkdir /root/tmp
# mv /etc/init/nova-compute.conf /root/tmp
# mv /etc/init.d/nova-compute /root/tmp</programlisting>
            <para>Next, shut your compute node down, perform your
                maintenance, and turn the node back on. You can
                re-enable the <code>nova-compute</code> service by
                undoing the previous commands:</para>
            <programlisting><?db-font-size 65%?># mv /root/tmp/nova-compute.conf /etc/init
# mv /root/tmp/nova-compute /etc/init.d/</programlisting>
            <para>Then start the <code>nova-compute</code>
                service:</para>
            <programlisting><?db-font-size 65%?># start nova-compute</programlisting>
            <para>You can now optionally migrate the instances back to
                their original compute node.</para>
        </section>
        <section xml:id="after_compute_node_reboot">
            <?dbhtml stop-chunking?>
            <title>After a Compute Node Reboots</title>
            <para>When you reboot a compute node, first verify that it
                booted successfully. This includes ensuring the
                    <code>nova-compute</code> service is
                running:</para>
            <programlisting><?db-font-size 65%?># ps aux | grep nova-compute
# status nova-compute</programlisting>
            <para>Also ensure that it has successfully connected to
                the AMQP server:</para>
            <programlisting><?db-font-size 65%?># grep AMQP /var/log/nova/nova-compute
2013-02-26 09:51:31 12427 INFO nova.openstack.common.rpc.common [-] Connected to AMQP server on 199.116.232.36:5672</programlisting>
            <para>After the compute node is successfully running, you
                must deal with the instances that are hosted on that
                compute node as none of them is running. Depending on
                your SLA with your users or customers, you might have
                to start each instance and ensure they start
                correctly.</para>
        </section>
        <section xml:id="maintenance_instances">
            <?dbhtml stop-chunking?>
            <title>Instances</title>
            <para>You can create a list of instances that are hosted
                on the compute node by performing the following
                command:</para>
            <programlisting><?db-font-size 65%?># nova list --host c01.example.com --all-tenants</programlisting>
            <para>After you have the list, you can use the nova
                command to start each instance:</para>
            <programlisting><?db-font-size 65%?># nova reboot &lt;uuid&gt;</programlisting>
            <note>
                <para>Any time an instance shuts down unexpectedly,
                    it might have problems on boot. For example, the
                    instance might require an <code>fsck</code> on the
                    root partition. If this happens, the user can use
                    the Dashboard VNC console to fix this.</para>
            </note>
            <para>If an instance does not boot, meaning <code>virsh
                    list</code> never shows the instance as even
                attempting to boot, do the following on the compute
                node:</para>
            <programlisting><?db-font-size 65%?># tail -f /var/log/nova/nova-compute.log</programlisting>
            <para>Try executing the <code>nova reboot</code> command
                again. You should see an error message about why the
                instance was not able to boot</para>
            <para>In most cases, the error is due to something in
                libvirt's XML file
                    (<code>/etc/libvirt/qemu/instance-xxxxxxxx.xml</code>)
                that no longer exists. You can enforce recreation of
                the XML file as well as rebooting the instance by
                running:</para>
            <programlisting><?db-font-size 65%?># nova reboot --hard &lt;uuid&gt;</programlisting>
        </section>
        <section xml:id="inspect_and_recover_failed_instances">
            <?dbhtml stop-chunking?>
            <title>Inspecting and Recovering Data from Failed Instances</title>
            <para>In some scenarios, instances are running but are inaccessible
                through SSH and do not respond to any command. VNC console could
                be displaying a boot failure or kernel panic error messages.
                This could be an indication of a file system corruption on the
                VM itself. If you need to recover files or inspect the content
                of the instance, qemu-nbd can be used to mount the disk.</para>
            <warning>
                <para>If you access or view the user's content and data, get
                    their approval first!</para>
            </warning>
            <para>To access the instance's disk
                (/var/lib/nova/instances/instance-xxxxxx/disk), the following
                steps must be followed:</para>
            <orderedlist>
                <listitem>
                    <para>Suspend the instance using the virsh command</para>
                </listitem>
                <listitem>
                    <para>Connect the qemu-nbd device to the disk</para>
                </listitem>
                <listitem>
                    <para>Mount the qemu-nbd device</para>
                </listitem>
                <listitem>
                    <para>Unmount the device after inspecting</para>
                </listitem>
                <listitem>
                    <para>Disconnect the qemu-nbd device</para>
                </listitem>
                <listitem>
                    <para>Resume the instance</para>
                </listitem>
            </orderedlist>
            <para>If you do not follow the steps from 4-6, OpenStack Compute
                cannot manage the instance any longer. It fails to respond to
                any command issued by OpenStack Compute and it is marked as
                shutdown.</para>
            <para>Once you mount the disk file, you should be able access it and
                treat it as normal directories with files and a directory
                structure. However, we do not recommend that you edit or touch
                any files because this could change the Access Control Lists
                (ACLs) which are used to determine which accounts can perform
                what operations on files and directories. Changing ACLs can make
                the instance unbootable if it is not already.</para>
            <orderedlist>
                <listitem>
                    <para>Suspend the instance using the virsh command - taking
                        note of the internal ID:</para>
                    <programlisting><?db-font-size 65%?># virsh list
Id Name                 State
----------------------------------
1 instance-00000981    running
2 instance-000009f5    running
30 instance-0000274a    running

# virsh suspend 30
Domain 30 suspended</programlisting>
                </listitem>
                <listitem>
                    <para>Connect the qemu-nbd device to the disk:</para>
                    <programlisting><?db-font-size 65%?># cd /var/lib/nova/instances/instance-0000274a
# ls -lh
total 33M
-rw-rw---- 1 libvirt-qemu kvm  6.3K Oct 15 11:31 console.log
-rw-r--r-- 1 libvirt-qemu kvm   33M Oct 15 22:06 disk
-rw-r--r-- 1 libvirt-qemu kvm  384K Oct 15 22:06 disk.local
-rw-rw-r-- 1 nova         nova 1.7K Oct 15 11:30 libvirt.xml
# qemu-nbd -c /dev/nbd0 `pwd`/disk</programlisting>
                </listitem>
                <listitem>
                    <para>Mount the qemu-nbd device.</para>
                    <para>The qemu-nbd device tries to export the instance
                        disk's different partitions as separate devices. For
                        example if vda as the disk and vda1 as the root
                        partition, qemu-nbd exports the device as /dev/nbd0 and
                        /dev/nbd0p1 respectively:</para>
                    <programlisting><?db-font-size 65%?># mount /dev/nbd0p1 /mnt/</programlisting>
                    <para>You can now access the contents of
                            <code>/mnt</code> which correspond to the
                        first partition of the instance's disk.</para>
                    <para>To examine the secondary or ephemeral disk, use an
                        alternate mount point if you want both primary and
                        secondary drives mounted at the same time:</para>
                    <programlisting><?db-font-size 65%?># umount /mnt
# qemu-nbd -c /dev/nbd1 `pwd`/disk.local
# mount /dev/nbd1 /mnt/</programlisting>
                    <programlisting><?db-font-size 65%?># ls -lh /mnt/
total 76K
lrwxrwxrwx.  1 root root    7 Oct 15 00:44 bin -&gt; usr/bin
dr-xr-xr-x.  4 root root 4.0K Oct 15 01:07 boot
drwxr-xr-x.  2 root root 4.0K Oct 15 00:42 dev
drwxr-xr-x. 70 root root 4.0K Oct 15 11:31 etc
drwxr-xr-x.  3 root root 4.0K Oct 15 01:07 home
lrwxrwxrwx.  1 root root    7 Oct 15 00:44 lib -&gt; usr/lib
lrwxrwxrwx.  1 root root    9 Oct 15 00:44 lib64 -&gt; usr/lib64
drwx------.  2 root root  16K Oct 15 00:42 lost+found
drwxr-xr-x.  2 root root 4.0K Feb  3  2012 media
drwxr-xr-x.  2 root root 4.0K Feb  3  2012 mnt
drwxr-xr-x.  2 root root 4.0K Feb  3  2012 opt
drwxr-xr-x.  2 root root 4.0K Oct 15 00:42 proc
dr-xr-x---.  3 root root 4.0K Oct 15 21:56 root
drwxr-xr-x. 14 root root 4.0K Oct 15 01:07 run
lrwxrwxrwx.  1 root root    8 Oct 15 00:44 sbin -&gt; usr/sbin
drwxr-xr-x.  2 root root 4.0K Feb  3  2012 srv
drwxr-xr-x.  2 root root 4.0K Oct 15 00:42 sys
drwxrwxrwt.  9 root root 4.0K Oct 15 16:29 tmp
drwxr-xr-x. 13 root root 4.0K Oct 15 00:44 usr
drwxr-xr-x. 17 root root 4.0K Oct 15 00:44 var</programlisting>
                </listitem>
                <listitem>
                    <para>Once you have completed the inspection, unmount the
                        mount point and release the qemu-nbd device:</para>
                    <programlisting><?db-font-size 65%?># umount /mnt
# qemu-nbd -d /dev/nbd0
/dev/nbd0 disconnected</programlisting>
                </listitem>
                <listitem>
                    <para>Resume the instance using virsh:</para>
                    <programlisting><?db-font-size 65%?># virsh list
Id Name                 State
----------------------------------
1 instance-00000981    running
2 instance-000009f5    running
30 instance-0000274a    paused

# virsh resume 30
Domain 30 resumed</programlisting>
                </listitem>
            </orderedlist>
        </section>
        <section xml:id="volumes">
            <?dbhtml stop-chunking?>
            <title>Volumes</title>
            <para>If the affected instances also had attached volumes,
                first generate a list of instance and volume
                UUIDs:</para>
            <programlisting><?db-font-size 65%?>mysql&gt; select nova.instances.uuid as instance_uuid, cinder.volumes.id as volume_uuid, cinder.volumes.status,
cinder.volumes.attach_status, cinder.volumes.mountpoint, cinder.volumes.display_name from cinder.volumes
inner join nova.instances on cinder.volumes.instance_uuid=nova.instances.uuid
 where nova.instances.host = 'c01.example.com';</programlisting>
            <para>You should see a result like the following:</para>
            <programlisting><?db-font-size 55%?>
+--------------+------------+-------+--------------+-----------+--------------+
|instance_uuid |volume_uuid |status |attach_status |mountpoint | display_name |
+--------------+------------+-------+--------------+-----------+--------------+
|9b969a05      |1f0fbf36    |in-use |attached      |/dev/vdc   | test         |
+--------------+------------+-------+--------------+-----------+--------------+
1 row in set (0.00 sec)</programlisting>
            <para>Next, manually detach and reattach the
                volumes:</para>
            <programlisting><?db-font-size 65%?># nova volume-detach &lt;instance_uuid&gt; &lt;volume_uuid&gt;
# nova volume-attach &lt;instance_uuid&gt; &lt;volume_uuid&gt; /dev/vdX</programlisting>
            <para>Where X is the proper mount point. Make sure that
                the instance has successfully booted and is at a login
                screen before doing the above.</para>
        </section>
        <section xml:id="totle_compute_node_failure">
            <?dbhtml stop-chunking?>
            <title>Total Compute Node Failure</title>
            <para>Compute nodes can fail the same way a cloud
                controller can fail. A motherboard failure or some
                other type of hardware failure can cause an entire
                compute node to go offline. When this happens, all
                instances running on that compute node will not be
                available. Just like with a cloud controller failure,
                if your infrastructure monitoring does not detect a
                failed compute node, your users will notify you due to
                their lost instances.</para>
            <para>If a compute node fails and won't be
                fixed for a few hours (or ever at all), you can
                relaunch all instances that are hosted on the failed
                node if you use shared storage for
                    <code>/var/lib/nova/instances</code>.</para>
            <para>To do this, generate a list of instance UUIDs that
                are hosted on the failed node by running the following
                query on the nova database:</para>
            <programlisting><?db-font-size 65%?>mysql&gt; select uuid from instances where host = 'c01.example.com' and deleted = 0;</programlisting>
            <para>Next, tell Nova that all instances that used to be
                hosted on c01.example.com are now hosted on
                c02.example.com:</para>
            <programlisting><?db-font-size 65%?>mysql&gt; update instances set host = 'c02.example.com' where host = 'c01.example.com' and deleted = 0;</programlisting>
            <para>After that, use the nova command to reboot all
                instances that were on c01.example.com while
                regenerating their XML files at the same time:</para>
            <programlisting><?db-font-size 65%?># nova reboot --hard &lt;uuid&gt;</programlisting>
            <para>Finally, re-attach volumes using the same method
                described in <emphasis role="bold">Volumes</emphasis>.</para>
        </section>
        <section xml:id="var_lib_nova_instances">
            <?dbhtml stop-chunking?>
            <title>/var/lib/nova/instances</title>
            <para>It's worth mentioning this directory in the context
                of failed compute nodes. This directory contains the
                libvirt KVM file-based disk images for the instances
                that are hosted on that compute node. If you are not
                running your cloud in a shared storage environment,
                this directory is unique across all compute
                nodes.</para>
            <para>
                <code>/var/lib/nova/instances</code> contains two
                types of directories.</para>
            <para>The first is the <code>_base</code> directory. This
                contains all of the cached base images from glance for
                each unique image that has been launched on that
                compute node. Files ending in <code>_20</code> (or a
                different number) are the ephemeral base
                images.</para>
            <para>The other directories are titled
                    <code>instance-xxxxxxxx</code>. These directories
                correspond to instances running on that compute node.
                The files inside are related to one of the files in
                the <code>_base</code> directory. They're essentially
                differential-based files containing only the changes
                made from the original <code>_base</code>
                directory.</para>
            <para>All files and directories in
                    <code>/var/lib/nova/instances</code> are uniquely
                named. The files in _base are uniquely titled for the
                glance image that they are based on and the directory
                names <code>instance-xxxxxxxx</code> are uniquely
                titled for that particular instance. For example, if
                you copy all data from
                    <code>/var/lib/nova/instances</code> on one
                compute node to another, you do not overwrite any
                files or cause any damage to images that have the same
                unique name, because they are essentially the same
                file.</para>
            <para>Although this method is not documented or supported,
                you can use it when your compute node is permanently
                offline but you have instances locally stored on
                it.</para>
        </section>
    </section>
    <section xml:id="storage_node_failures">
        <?dbhtml stop-chunking?>
        <title>Storage Node Failures and Maintenance</title>
        <para>Due to the Object Storage's high redundancy, dealing
            with object storage node issues is a lot easier than
            dealing with compute node issues.</para>
        <section xml:id="reboot_storage_node">
            <?dbhtml stop-chunking?>
            <title>Rebooting a Storage Node</title>
            <para>If a storage node requires a reboot, simply reboot
                it. Requests for data hosted on that node are
                redirected to other copies while the server is
                rebooting.</para>
        </section>
        <section xml:id="shut_down_storage_node">
            <?dbhtml stop-chunking?>
            <title>Shutting Down a Storage Node</title>
            <para>If you need to shut down a storage node for an
                extended period of time (1+ days), consider removing
                the node from the storage ring. For example:</para>
            <programlisting><?db-font-size 65%?># swift-ring-builder account.builder remove &lt;ip address of storage node&gt;
# swift-ring-builder container.builder remove &lt;ip address of storage node&gt;
# swift-ring-builder object.builder remove &lt;ip address of storage node&gt;
# swift-ring-builder account.builder rebalance
# swift-ring-builder container.builder rebalance
# swift-ring-builder object.builder rebalance</programlisting>
            <para>Next, redistribute the ring files to the other
                nodes:</para>
            <programlisting><?db-font-size 65%?># for i in s01.example.com s02.example.com s03.example.com
&gt; do
&gt; scp *.ring.gz $i:/etc/swift
&gt; done</programlisting>
            <para>These actions effectively take the storage node out
                of the storage cluster.</para>
            <para>When the node is able to rejoin the cluster, just
                add it back to the ring. The exact syntax to add a
                node to your Swift cluster using
                    <code>swift-ring-builder</code> heavily depends on
                the original options used when you originally created
                your cluster. Please refer back to those
                commands.</para>
        </section>
        <section xml:id="replace_swift_disk">
            <?dbhtml stop-chunking?>
            <title>Replacing a Swift Disk</title>
            <para>If a hard drive fails in a Object Storage node,
                replacing it is relatively easy. This assumes that
                your Object Storage environment is configured
                correctly where the data that is stored on the failed drive
                is also replicated to other drives in the Object
                Storage environment.</para>
            <para>This example assumes that <code>/dev/sdb</code> has
                failed.</para>
            <para>First, unmount the disk:</para>
            <programlisting><?db-font-size 65%?># umount /dev/sdb</programlisting>
            <para>Next, physically remove the disk from the server and
                replace it with a working disk.</para>
            <para>Ensure that the operating system has recognized the
                new disk:</para>
            <programlisting><?db-font-size 65%?># dmesg | tail</programlisting>
            <para>You should see a message about /dev/sdb.</para>
            <para>Because it is recommended to not use partitions on a
                swift disk, simply format the disk as a whole:</para>
            <programlisting><?db-font-size 65%?># mkfs.xfs /dev/sdb</programlisting>
            <para>Finally, mount the disk:</para>
            <programlisting><?db-font-size 65%?># mount -a</programlisting>
            <para>Swift should notice the new disk and that no data
                exists. It then begins replicating the data to the
                disk from the other existing replicas.</para>
        </section>
    </section>
    <section xml:id="complete_failure">
        <?dbhtml stop-chunking?>
        <title>Handling a Complete Failure</title>
        <para>A common way of dealing with the recovery from a full
            system failure, such as a power outage of a data center is
            to assign each service a priority, and restore in
            order.</para>
        <table rules="all">
            <caption>Example Service Restoration Priority
                List</caption>
            <tbody>
                <tr>
                    <td xmlns:db="http://docbook.org/ns/docbook"
                            ><para>1</para></td>
                    <td xmlns:db="http://docbook.org/ns/docbook"
                            ><para>Internal network
                            connectivity</para></td>
                </tr>
                <tr>
                    <td xmlns:db="http://docbook.org/ns/docbook"
                            ><para>2</para></td>
                    <td xmlns:db="http://docbook.org/ns/docbook"
                            ><para>Backing storage
                        services</para></td>
                </tr>
                <tr>
                    <td xmlns:db="http://docbook.org/ns/docbook"
                            ><para>3</para></td>
                    <td xmlns:db="http://docbook.org/ns/docbook"
                            ><para>Public network connectivity for
                            user Virtual Machines</para></td>
                </tr>
                <tr>
                    <td xmlns:db="http://docbook.org/ns/docbook"
                            ><para>4</para></td>
                    <td xmlns:db="http://docbook.org/ns/docbook"
                            ><para>Nova-compute, nova-network, cinder
                            hosts</para></td>
                </tr>
                <tr>
                    <td xmlns:db="http://docbook.org/ns/docbook"
                            ><para>5</para></td>
                    <td xmlns:db="http://docbook.org/ns/docbook"
                            ><para>User virtual machines</para></td>
                </tr>
                <tr>
                    <td xmlns:db="http://docbook.org/ns/docbook"
                            ><para>10</para></td>
                    <td xmlns:db="http://docbook.org/ns/docbook"
                            ><para>Message Queue and Database
                            services</para></td>
                </tr>
                <tr>
                    <td xmlns:db="http://docbook.org/ns/docbook"
                            ><para>15</para></td>
                    <td xmlns:db="http://docbook.org/ns/docbook"
                            ><para>Keystone services</para></td>
                </tr>
                <tr>
                    <td xmlns:db="http://docbook.org/ns/docbook"
                            ><para>20</para></td>
                    <td xmlns:db="http://docbook.org/ns/docbook"
                            ><para>cinder-scheduler</para></td>
                </tr>
                <tr>
                    <td xmlns:db="http://docbook.org/ns/docbook"
                            ><para>21</para></td>
                    <td xmlns:db="http://docbook.org/ns/docbook"
                            ><para>Image Catalogue and Delivery
                            services</para></td>
                </tr>
                <tr>
                    <td xmlns:db="http://docbook.org/ns/docbook"
                            ><para>22</para></td>
                    <td xmlns:db="http://docbook.org/ns/docbook"
                            ><para>nova-scheduler services</para></td>
                </tr>
                <tr>
                    <td xmlns:db="http://docbook.org/ns/docbook"
                            ><para>98</para></td>
                    <td xmlns:db="http://docbook.org/ns/docbook"
                            ><para>Cinder-api</para></td>
                </tr>
                <tr>
                    <td xmlns:db="http://docbook.org/ns/docbook"
                            ><para>99</para></td>
                    <td xmlns:db="http://docbook.org/ns/docbook"
                            ><para>Nova-api services</para></td>
                </tr>
                <tr>
                    <td xmlns:db="http://docbook.org/ns/docbook"
                            ><para>100</para></td>
                    <td xmlns:db="http://docbook.org/ns/docbook"
                            ><para>Dashboard node</para></td>
                </tr>
            </tbody>
        </table>
        <para>Use this example priority list to ensure that user
            affected services are restored as soon as possible, but
            not before a stable environment is in place. Of course,
            despite being listed as a single line item, each step
            requires significant work. For example, just after
            starting the database, you should check its integrity or,
            after starting the Nova services, you should verify that
            the hypervisor matches the database and fix any
            mismatches.</para>
    </section>
    <section xml:id="config_mgmt">
        <?dbhtml stop-chunking?>
        <title>Configuration Management</title>
        <para>Maintaining an OpenStack cloud requires that you manage
            multiple physical servers, and this number might grow over
            time. Because managing nodes manually is error-prone, we
            strongly recommend that you use a configuration management
            tool. These tools automate the process of ensuring that
            all of your nodes are configured properly and encourage
            you to maintain your configuration information (such as
            packages and configuration options) in a version
            controlled repository.</para>
        <tip><para>Several configuration management tools are available,
            and this guide does not recommend a specific one. The two
            most popular ones in the OpenStack community are <link
                xlink:href="https://puppetlabs.com/">Puppet</link>
            (https://puppetlabs.com/) with available <link
                xlink:title="Optimization Overview"
                xlink:href="http://github.com/puppetlabs/puppetlabs-openstack"
                >OpenStack Puppet modules</link>
            (http://github.com/puppetlabs/puppetlabs-openstack) and
                <link xlink:href="http://www.opscode.com/chef/"
                >Chef</link> (http://opscode.com/chef) with available
                <link
                xlink:href="https://github.com/opscode/openstack-chef-repo"
                >OpenStack Chef recipes</link>
            (https://github.com/opscode/openstack-chef-repo). Other
            newer configuration tools include <link
                xlink:href="https://juju.ubuntu.com/">Juju</link>
            (https://juju.ubuntu.com/) <link
                xlink:href="http://ansible.cc">Ansible</link>
            (http://ansible.cc) and <link
                xlink:href="http://saltstack.com/">Salt</link>
            (http://saltstack.com), and more mature configuration
            management tools include <link
                xlink:href="http://cfengine.com/">CFEngine</link>
            (http://cfengine.com) and <link
                xlink:href="http://bcfg2.org/">Bcfg2</link>
            (http://bcfg2.org).</para></tip>
    </section>
    <section xml:id="hardware">
        <?dbhtml stop-chunking?>
        <title>Working with Hardware</title>
        <para>Similar to your initial deployment, you should ensure
            all hardware is appropriately burned in before adding it
            to production. Run software that uses the hardware to its
            limits - maxing out RAM, CPU, disk and network. Many
            options are available, and normally double as benchmark
            software so you also get a good idea of the performance of
            your system.</para>
        <section xml:id="add_new_node">
            <?dbhtml stop-chunking?>
            <title>Adding a Compute Node</title>
            <para>If you find that you have reached or are reaching
                the capacity limit of your computing resources, you
                should plan to add additional compute nodes. Adding
                more nodes is quite easy. The process for adding compute nodes
                is the same as when the initial compute nodes were
                deployed to your cloud: use an automated deployment
                system to bootstrap the bare-metal server with the
                operating system and then have a configuration
                management system install and configure the OpenStack
                Compute service. Once the Compute service has been
                installed and configured in the same way as the other
                compute nodes, it automatically attaches itself to the
                cloud. The cloud controller notices the new node(s)
                and begin scheduling instances to launch there.</para>
            <para>If your OpenStack Block Storage nodes are separate
                from your compute nodes, the same procedure still
                applies as the same queuing and polling system is used
                in both services.</para>
            <para>We recommend that you use the same hardware for new
                compute and block storage nodes. At the very least,
                ensure that the CPUs are similar in the compute nodes
                to not break live migration.</para>
        </section>
        <section xml:id="add_new_object_node">
            <?dbhtml stop-chunking?>
            <title>Adding an Object Storage Node</title>
            <para>Adding a new object storage node is different than
                adding compute or block storage nodes. You still want
                to initially configure the server by using your
                automated deployment and configuration management
                systems. After that is done, you need to add the local
                disks of the object storage node into the object
                storage ring. The exact command to do this is the same
                command that was used to add the initial disks to the
                ring. Simply re-run this command on the object storage
                proxy server for all disks on the new object storage
                node. Once this has been done, rebalance the ring and
                copy the resulting ring files to the other storage
                nodes.</para>
            <note>
                <para>If your new object storage node has a different
                    number of disks than the original nodes have, the
                    command to add the new node is different than the
                    original commands. These parameters vary from
                    environment to environment.</para>
            </note>
        </section>
        <section xml:id="replace_components">
            <?dbhtml stop-chunking?>
            <title>Replacing Components</title>
            <para>Failures of hardware are common in large scale
                deployments such as an infrastructure cloud. Consider
                your processes and balance time saving against
                availability. For example, an Object Storage cluster
                can easily live with dead disks in it for some period
                of time if it has sufficient capacity. Or, if your
                compute installation is not full you could consider
                live migrating instances off a host with a RAM failure
                until you have time to deal with the problem.</para>
        </section>
    </section>
    <section xml:id="databases">
        <?dbhtml stop-chunking?>
        <title>Databases</title>
        <para>Almost all OpenStack components have an underlying
            database to store persistent information. Usually this
            database is MySQL. Normal MySQL administration is
            applicable to these databases. OpenStack does not
            configure the databases out of the ordinary. Basic
            administration includes performance tweaking, high
            availability, backup, recovery, and repairing. For more
            information, see a standard MySQL administration
            guide.</para>
        <para>You can perform a couple tricks with the database to
            either more quickly retrieve information or fix a data
            inconsistency error. For example, an instance was
            terminated but the status was not updated in the database.
            These tricks are discussed throughout this book.</para>
        <section xml:id="database_connect">
            <?dbhtml stop-chunking?>
            <title>Database Connectivity</title>
            <para>Review the components configuration file to see how
                each OpenStack component accesses its corresponding
                database. Look for either <code>sql_connection</code>
                or simply <code>connection</code>. The following
                command uses <code>grep</code> to display the SQL
                connection string for nova, glance, cinder, and
                keystone:</para>
            <programlisting><?db-font-size 65%?># grep -hE "connection ?=" /etc/nova/nova.conf /etc/glance/glance-*.conf
/etc/cinder/cinder.conf /etc/keystone/keystone.conf
        sql_connection = mysql://nova:nova@cloud.alberta.sandbox.cybera.ca/nova
        sql_connection = mysql://glance:password@cloud.example.com/glance
        sql_connection = mysql://glance:password@cloud.example.com/glance
        sql_connection = mysql://cinder:password@cloud.example.com/cinder
        connection = mysql://keystone_admin:password@cloud.example.com/keystone</programlisting>
            <para>The connection strings take this format:</para>
            <programlisting><?db-font-size 65%?>mysql:// &lt;username&gt; : &lt;password&gt; @ &lt;hostname&gt; / &lt;database name&gt;</programlisting>
        </section>
        <section xml:id="perf_and_opt">
            <?dbhtml stop-chunking?>
            <title>Performance and Optimizing</title>
            <para>As your cloud grows, MySQL is utilized more and
                more. If you suspect that MySQL might be becoming a
                bottleneck, you should start researching MySQL
                optimization. The MySQL manual has an entire section
                dedicated to this topic <link
                    xlink:href="http://dev.mysql.com/doc/refman/5.5/en/optimize-overview.html"
                    >Optimization Overview</link>
                (http://dev.mysql.com/doc/refman/5.5/en/optimize-overview.html).</para>
        </section>
    </section>
    <section xml:id="hdmy">
        <?dbhtml stop-chunking?>
        <title>HDWMY</title>
        <para>Here's a quick list of various to-do items each hour,
            day, week, month, and year. Please note these tasks are
            neither required nor definitive, but helpful ideas:</para>
        <section xml:id="hourly">
            <?dbhtml stop-chunking?>
            <title>Hourly</title>
            <itemizedlist>
                <listitem>
                    <para>Check your monitoring system for alerts and
                        act on them.</para>
                </listitem>
                <listitem>
                    <para>Check your ticket queue for new
                        tickets.</para>
                </listitem>
            </itemizedlist>
        </section>
        <section xml:id="daily">
            <?dbhtml stop-chunking?>
            <title>Daily</title>
            <itemizedlist>
                <listitem>
                    <para>Check for instances in a failed or weird
                        state and investigate why.</para>
                </listitem>
                <listitem>
                    <para>Check for security patches and apply them as
                        needed.</para>
                </listitem>
            </itemizedlist>
        </section>
        <section xml:id="weekly">
            <?dbhtml stop-chunking?>
            <title>Weekly</title>
            <itemizedlist>
                <listitem>
                    <para>Check cloud usage: <itemizedlist>
                            <listitem>
                                <para>User quotas</para>
                            </listitem>
                            <listitem>
                                <para>Disk space</para>
                            </listitem>
                            <listitem>
                                <para>Image usage</para>
                            </listitem>
                            <listitem>
                                <para>Large instances</para>
                            </listitem>
                            <listitem>
                                <para>Network usage (bandwidth and IP
                                   usage)</para>
                            </listitem>
                        </itemizedlist></para>
                </listitem>
                <listitem>
                    <para>Verify your alert mechanisms are still
                        working.</para>
                </listitem>
            </itemizedlist>
        </section>
        <section xml:id="monthly">
            <?dbhtml stop-chunking?>
            <title>Monthly</title>
            <itemizedlist>
                <listitem>
                    <para>Check usage and trends over the past
                        month.</para>
                </listitem>
                <listitem>
                    <para>Check for user accounts that should be
                        removed.</para>
                </listitem>
                <listitem>
                    <para>Check for operator accounts that should be
                        removed.</para>
                </listitem>
            </itemizedlist>
        </section>
        <section xml:id="quarterly">
            <?dbhtml stop-chunking?>
            <title>Quarterly</title>
            <itemizedlist>
                <listitem>
                    <para>Review usage and trends over the past
                        quarter.</para>
                </listitem>
                <listitem>
                    <para>Prepare any quarterly reports on usage and
                        statistics.</para>
                </listitem>
                <listitem>
                    <para>Review and plan any necessary cloud
                        additions.</para>
                </listitem>
                <listitem>
                    <para>Review and plan any major OpenStack
                        upgrades.</para>
                </listitem>
            </itemizedlist>
        </section>
        <section xml:id="semiannual">
            <?dbhtml stop-chunking?>
            <title>Semi-Annually</title>
            <itemizedlist>
                <listitem>
                    <para>Upgrade OpenStack.</para>
                </listitem>
                <listitem>
                    <para>Clean up after OpenStack upgrade (any unused
                        or new services to be aware of?)</para>
                </listitem>
            </itemizedlist>
        </section>
    </section>
    <section xml:id="broken_component">
        <?dbhtml stop-chunking?>
        <title>Determining which Component Is Broken</title>
        <para>OpenStack's collection of different components interact
            with each other strongly. For example, uploading an image
            requires interaction from <code>nova-api</code>,
                <code>glance-api</code>, <code>glance-registry</code>,
            Keystone, and potentially <code>swift-proxy</code>. As a
            result, it is sometimes difficult to determine exactly
            where problems lie. Assisting in this is the purpose of
            this section.</para>
        <section xml:id="tailing_logs">
            <?dbhtml stop-chunking?>
            <title>Tailing Logs</title>
            <para>The first place to look is the log file related to
                the command you are trying to run. For example, if
                    <code>nova list</code> is failing, try tailing a
                Nova log file and running the command again:</para>
            <para>Terminal 1:</para>
            <programlisting><?db-font-size 65%?># tail -f /var/log/nova/nova-api.log</programlisting>
            <para>Terminal 2:</para>
            <programlisting><?db-font-size 65%?># nova list</programlisting>
            <para>Look for any errors or traces in the log file. For
                more information, see the chapter on <emphasis
                    role="bold">Logging and
                Monitoring</emphasis>.</para>
            <para>If the error indicates that the problem is with
                another component, switch to tailing that component's
                log file. For example, if nova cannot access glance,
                look at the glance-api log:</para>
            <para>Terminal 1:</para>
            <programlisting><?db-font-size 65%?># tail -f /var/log/glance/api.log</programlisting>
            <para>Terminal 2:</para>
            <programlisting><?db-font-size 65%?># nova list</programlisting>
            <para>Wash, rinse, repeat until you find the core cause of
                the problem.</para>
        </section>

        <section xml:id="daemons_cli">
            <?dbhtml stop-chunking?>
            <title>Running Daemons on the CLI</title>
            <para>Unfortunately, sometimes the error is not apparent
                from the log files. In this case, switch tactics and
                use a different command, maybe run the service
                directly on the command line. For example, if the
                    <code>glance-api</code> service refuses to start
                and stay running, try launching the daemon from the
                command line:</para>
            <programlisting><?db-font-size 65%?># sudo -u glance -H glance-api</programlisting>
            <para>This might print the error and cause of the problem.<note>
                    <para>The <literal>-H</literal> flag is required
                        when running the daemons with sudo because
                        some daemons will write files relative to the
                        user's home directory, and this write may fail
                        if <literal>-H</literal> is left off.</para>
                </note></para>
        <sidebar>
            <title>Example of Complexity</title>
            <para>One morning, a compute node failed to run any
                instances. The log files were a bit vague, claiming
                that a certain instance was unable to be started. This
                ended up being a red herring because the instance was
                simply the first instance in alphabetical order, so it
                was the first instance that nova-compute would touch.</para>
            <para>Further troubleshooting showed that libvirt was not
                running at all. This made more sense. If libvirt
                wasn't running, then no instance could be virtualized
                through KVM. Upon trying to start libvirt, it would
                silently die immediately. The libvirt logs did not
                explain why.</para>
            <para>Next, the <code>libvirtd</code> daemon was run on
                the command line. Finally a helpful error message: it
                could not connect to d-bus. As ridiculous as it
                sounds, libvirt, and thus <code>nova-compute</code>,
                relies on d-bus and somehow d-bus crashed. Simply
                starting d-bus set the entire chain back on track and
                soon everything was back up and running.</para>
        </sidebar>
        </section>
    </section>
        <section xml:id="uninstalling">
            <?dbhtml stop-chunking?>
            <title>Uninstalling</title>
            <para>While we'd always recommend using your automated
            deployment system to re-install systems from scratch,
            sometimes you do need to remove OpenStack from a system
            the hard way. Here's how:</para>
            <itemizedlist>
                <listitem><para>Remove all packages</para></listitem>
                <listitem><para>Remove remaining files</para></listitem>
                <listitem><para>Remove databases</para></listitem>
            </itemizedlist>
            <para>These steps depend on your underlying distribution,
            but in general you should be looking for 'purge' commands
            in your package manager, like <literal>aptitude purge ~c $package</literal>.
            Following this, you can look for orphaned files in the
            directories referenced throughout this guide. For uninstalling
            the database properly, refer to the manual appropriate for
            the product in use.</para>
        </section>
</chapter>