Files
openstack-manuals/doc/admin-guide-cloud/compute/section_compute-system-admin.xml
Summer Long cbc80898a6 Moved firewall section and restructured Compute section
Moved firewall section from CRG Compute to CAG Compute.
Renamed log-file chapter to match other 'file' chapter.
Moved Compute sections into files to trim down the massive
Compute chapter file. Edited touched files.
In section_cli_nova_volumes.xml, added example and one new option.
In section_compute-rootwrap.xml, added note with
NFS share info.
In section_system-admin.xml:
* Added new services.
* Replaced deprecated nova-manage commands with nova.

Change-Id: Ie300a9ce25d305b80bb0b21d3cfc318909f3a123
2014-03-27 15:58:48 +10:00

935 lines
56 KiB
XML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<?xml version="1.0" encoding="UTF-8"?>
<section xml:id="section_compute-system-admin"
xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0">
<title>System administration</title>
<para>By understanding how the different installed nodes
interact with each other, you can administer the Compute
installation. Compute offers many ways to install using
multiple servers but the general idea is that you can have
multiple compute nodes that control the virtual servers
and a cloud controller node that contains the remaining
Compute services.</para>
<para>The Compute cloud works through the interaction of a series of daemon processes named
<systemitem>nova-*</systemitem> that reside persistently on the host machine or
machines. These binaries can all run on the same machine or be spread out on multiple boxes
in a large deployment. The responsibilities of services and drivers are:</para>
<para>
<itemizedlist>
<listitem>
<para>Services:</para>
<itemizedlist>
<listitem>
<para><systemitem class="service">nova-api</systemitem>. Receives xml
requests and sends them to the rest of the system. It is a wsgi app that
routes and authenticate requests. It supports the EC2 and OpenStack
APIs. There is a <filename>nova-api.conf</filename> file created when
you install Compute.</para>
</listitem>
<listitem>
<para><systemitem>nova-cert</systemitem>. Provides the certificate
manager.</para>
</listitem>
<listitem>
<para><systemitem class="service">nova-compute</systemitem>. Responsible for
managing virtual machines. It loads a Service object which exposes the
public methods on ComputeManager through Remote Procedure Call
(RPC).</para>
</listitem>
<listitem>
<para><systemitem>nova-conductor</systemitem>. Provides database-access
support for Compute nodes (thereby reducing security risks).</para>
</listitem>
<listitem>
<para><systemitem>nova-consoleauth</systemitem>. Handles console
authentication.</para>
</listitem>
<listitem>
<para><systemitem class="service">nova-objectstore</systemitem>: The
<systemitem class="service">nova-objectstore</systemitem> service is
an ultra simple file-based storage system for images that replicates
most of the S3 API. It can be replaced with OpenStack Image Service and
a simple image manager or use OpenStack Object Storage as the virtual
machine image storage facility. It must reside on the same node as
<systemitem class="service">nova-compute</systemitem>.</para>
</listitem>
<listitem>
<para><systemitem class="service">nova-network</systemitem>. Responsible for
managing floating and fixed IPs, DHCP, bridging and VLANs. It loads a
Service object which exposes the public methods on one of the subclasses
of NetworkManager. Different networking strategies are available to the
service by changing the network_manager configuration option to
FlatManager, FlatDHCPManager, or VlanManager (default is VLAN if no
other is specified).</para>
</listitem>
<listitem>
<para><systemitem>nova-scheduler</systemitem>. Dispatches requests for
new virtual machines to the correct node.</para>
</listitem>
<listitem>
<para><systemitem>nova-novncproxy</systemitem>. Provides a VNC proxy for
browsers (enabling VNC consoles to access virtual machines).</para>
</listitem>
</itemizedlist>
</listitem>
<listitem>
<para>Some services have drivers that change how the service implements the core of
its functionality. For example, the <systemitem>nova-compute</systemitem>
service supports drivers that let you choose with which hypervisor type it will
talk. <systemitem>nova-network</systemitem> and
<systemitem>nova-scheduler</systemitem> also have drivers.</para>
</listitem>
</itemizedlist>
</para>
<section xml:id="section_compute-service-arch">
<title>Compute service architecture</title>
<para>The following basic categories describe the service architecture and what's going
on within the cloud controller.</para>
<simplesect>
<title>API server</title>
<para>At the heart of the cloud framework is an API server. This API server makes
command and control of the hypervisor, storage, and networking programmatically
available to users.</para>
<para>The API endpoints are basic HTTP web services
which handle authentication, authorization, and
basic command and control functions using various
API interfaces under the Amazon, Rackspace, and
related models. This enables API compatibility
with multiple existing tool sets created for
interaction with offerings from other vendors.
This broad compatibility prevents vendor
lock-in.</para>
</simplesect>
<simplesect>
<title>Message queue</title>
<para>A messaging queue brokers the interaction
between compute nodes (processing), the networking
controllers (software which controls network
infrastructure), API endpoints, the scheduler
(determines which physical hardware to allocate to
a virtual resource), and similar components.
Communication to and from the cloud controller is
by HTTP requests through multiple API
endpoints.</para>
<para>A typical message passing event begins with the API server receiving a request
from a user. The API server authenticates the user and ensures that the user is
permitted to issue the subject command. The availability of objects implicated in
the request is evaluated and, if available, the request is routed to the queuing
engine for the relevant workers. Workers continually listen to the queue based on
their role, and occasionally their type host name. When an applicable work request
arrives on the queue, the worker takes assignment of the task and begins its
execution. Upon completion, a response is dispatched to the queue which is received
by the API server and relayed to the originating user. Database entries are queried,
added, or removed as necessary throughout the process.</para>
</simplesect>
<simplesect>
<title>Compute worker</title>
<para>Compute workers manage computing instances on
host machines. The API dispatches commands to
compute workers to complete these tasks:</para>
<itemizedlist>
<listitem>
<para>Run instances</para>
</listitem>
<listitem>
<para>Terminate instances</para>
</listitem>
<listitem>
<para>Reboot instances</para>
</listitem>
<listitem>
<para>Attach volumes</para>
</listitem>
<listitem>
<para>Detach volumes</para>
</listitem>
<listitem>
<para>Get console output</para>
</listitem>
</itemizedlist>
</simplesect>
<simplesect>
<title>Network Controller</title>
<para>The Network Controller manages the networking
resources on host machines. The API server
dispatches commands through the message queue,
which are subsequently processed by Network
Controllers. Specific operations include:</para>
<itemizedlist>
<listitem>
<para>Allocate fixed IP addresses</para>
</listitem>
<listitem>
<para>Configuring VLANs for projects</para>
</listitem>
<listitem>
<para>Configuring networks for compute
nodes</para>
</listitem>
</itemizedlist>
</simplesect>
</section>
<section xml:id="section_manage-compute-users">
<title>Manage Compute users</title>
<para>Access to the Euca2ools (ec2) API is controlled by
an access and secret key. The users access key needs
to be included in the request, and the request must be
signed with the secret key. Upon receipt of API
requests, Compute verifies the signature and runs
commands on behalf of the user.</para>
<para>To begin using Compute, you must create a user with
the Identity Service.</para>
</section>
<section xml:id="section_manage-the-cloud">
<title>Manage the cloud</title>
<para>A system administrator can use the <command>nova</command> client and the
<command>Euca2ools</command> commands to manage the cloud.</para>
<para>Both nova client and euca2ools can be used by all users, though specific commands
might be restricted by Role Based Access Control in the Identity Service.</para>
<procedure>
<title>To use the nova client</title>
<step>
<para>Installing the <package>python-novaclient</package> package gives you a
<code>nova</code> shell command that enables Compute API interactions from
the command line. Install the client, and then provide your user name and
password (typically set as environment variables for convenience), and then you
have the ability to send commands to your cloud on the command line.</para>
<para>To install <package>python-novaclient</package>, download the tarball from
<link
xlink:href="http://pypi.python.org/pypi/python-novaclient/2.6.3#downloads"
>http://pypi.python.org/pypi/python-novaclient/2.6.3#downloads</link> and
then install it in your favorite python environment.</para>
<screen><prompt>$</prompt> <userinput>curl -O http://pypi.python.org/packages/source/p/python-novaclient/python-novaclient-2.6.3.tar.gz</userinput>
<prompt>$</prompt> <userinput>tar -zxvf python-novaclient-2.6.3.tar.gz</userinput>
<prompt>$</prompt> <userinput>cd python-novaclient-2.6.3</userinput></screen>
<para>As <systemitem class="username">root</systemitem> execute:</para>
<screen><prompt>#</prompt> <userinput>python setup.py install</userinput></screen>
</step>
<step>
<para>Confirm the installation by running:</para>
<screen><prompt>$</prompt> <userinput>nova help</userinput>
<computeroutput>usage: nova [--version] [--debug] [--os-cache] [--timings]
[--timeout &lt;seconds&gt;] [--os-username &lt;auth-user-name&gt;]
[--os-password &lt;auth-password&gt;]
[--os-tenant-name &lt;auth-tenant-name&gt;]
[--os-tenant-id &lt;auth-tenant-id&gt;] [--os-auth-url &lt;auth-url&gt;]
[--os-region-name &lt;region-name&gt;] [--os-auth-system &lt;auth-system&gt;]
[--service-type &lt;service-type&gt;] [--service-name &lt;service-name&gt;]
[--volume-service-name &lt;volume-service-name&gt;]
[--endpoint-type &lt;endpoint-type&gt;]
[--os-compute-api-version &lt;compute-api-ver&gt;]
[--os-cacert &lt;ca-certificate&gt;] [--insecure]
[--bypass-url &lt;bypass-url&gt;]
&lt;subcommand&gt; ...</computeroutput></screen>
<note><para>This command returns a list of <command>nova</command> commands and parameters. To obtain help
for a subcommand, run:</para>
<screen><prompt>$</prompt> <userinput>nova help <replaceable>subcommand</replaceable></userinput></screen>
<para>You can also refer to the <link
xlink:href="http://docs.openstack.org/cli-reference/content/">
<citetitle>OpenStack Command-Line Reference</citetitle></link>
for a complete listing of <command>nova</command>
commands and parameters.</para></note>
</step>
<step>
<para>Set the required parameters as environment variables to make running
commands easier. For example, you can add <parameter>--os-username</parameter>
as a <command>nova</command> option, or set it as an environment variable. To
set the user name, password, and tenant as environment variables, use:</para>
<screen><prompt>$</prompt> <userinput>export OS_USERNAME=joecool</userinput>
<prompt>$</prompt> <userinput>export OS_PASSWORD=coolword</userinput>
<prompt>$</prompt> <userinput>export OS_TENANT_NAME=coolu</userinput> </screen>
</step>
<step>
<para>Using the Identity Service, you are supplied with an authentication
endpoint, which Compute recognizes as the <literal>OS_AUTH_URL</literal>.</para>
<para>
<screen><prompt>$</prompt> <userinput>export OS_AUTH_URL=http://hostname:5000/v2.0</userinput>
<prompt>$</prompt> <userinput>export NOVA_VERSION=1.1</userinput></screen>
</para>
</step>
</procedure>
<simplesect>
<title>Use the euca2ools commands</title>
<para>For a command-line interface to EC2 API calls, use the
<command>euca2ools</command> command-line tool. See <link
xlink:href="http://open.eucalyptus.com/wiki/Euca2oolsGuide_v1.3"
>http://open.eucalyptus.com/wiki/Euca2oolsGuide_v1.3</link></para>
</simplesect>
</section>
<xi:include
href="../../common/section_cli_nova_usage_statistics.xml"/>
<section xml:id="section_manage-logs">
<title>Manage logs</title>
<simplesect>
<title>Logging module</title>
<para>To specify a configuration file to change the logging behavior, add this line to
the <filename>/etc/nova/nova.conf</filename> file . To change the logging level,
such as <literal>DEBUG</literal>, <literal>INFO</literal>,
<literal>WARNING</literal>, <literal>ERROR</literal>), use:
<programlisting language="ini">log-config=/etc/nova/logging.conf</programlisting></para>
<para>The logging configuration file is an ini-style configuration file, which must
contain a section called <literal>logger_nova</literal>, which controls the behavior
of the logging facility in the <literal>nova-*</literal> services. For
example:<programlisting language="ini">[logger_nova]
level = INFO
handlers = stderr
qualname = nova</programlisting></para>
<para>This example sets the debugging level to <literal>INFO</literal> (which less
verbose than the default <literal>DEBUG</literal> setting). <itemizedlist>
<listitem>
<para>For more details on the logging configuration syntax, including the
meaning of the <literal>handlers</literal> and
<literal>quaname</literal> variables, see the <link
xlink:href="http://docs.python.org/release/2.7/library/logging.html#configuration-file-format"
>Python documentation on logging configuration file format
</link>f.</para>
</listitem>
<listitem>
<para>For an example <filename>logging.conf</filename> file with various
defined handlers, see the
<link xlink:href="http://docs.openstack.org/trunk/config-reference/content/">
<citetitle>OpenStack Configuration Reference</citetitle></link>.</para>
</listitem>
</itemizedlist>
</para>
</simplesect>
<simplesect>
<title>Syslog</title>
<para>You can configure OpenStack Compute services to send logging information to
<systemitem>syslog</systemitem>. This is useful if you want to use
<systemitem>rsyslog</systemitem>, which forwards the logs to a remote machine.
You need to separately configure the Compute service (nova), the Identity service
(keystone), the Image Service (glance), and, if you are using it, the Block Storage
service (cinder) to send log messages to <systemitem>syslog</systemitem>. To do so,
add the following lines to:</para>
<itemizedlist>
<listitem>
<para><filename>/etc/nova/nova.conf</filename></para>
</listitem>
<listitem>
<para><filename>/etc/keystone/keystone.conf</filename></para>
</listitem>
<listitem>
<para><filename>/etc/glance/glance-api.conf</filename></para>
</listitem>
<listitem>
<para><filename>/etc/glance/glance-registry.conf</filename></para>
</listitem>
<listitem>
<para><filename>/etc/cinder/cinder.conf</filename></para>
</listitem>
</itemizedlist>
<programlisting language="ini">verbose = False
debug = False
use_syslog = True
syslog_log_facility = LOG_LOCAL0</programlisting>
<para>In addition to enabling <systemitem>syslog</systemitem>, these settings also
turn off more verbose output and debugging output from the log.<note>
<para>Although the example above uses the same local facility for each service
(<literal>LOG_LOCAL0</literal>, which corresponds to
<systemitem>syslog</systemitem> facility <literal>LOCAL0</literal>), we
recommend that you configure a separate local facility for each service, as
this provides better isolation and more flexibility. For example, you may
want to capture logging information at different severity levels for
different services. <systemitem>syslog</systemitem> allows you to define up
to seven local facilities, <literal>LOCAL0, LOCAL1, ..., LOCAL7</literal>.
For more details, see the <systemitem>syslog</systemitem>
documentation.</para>
</note></para>
</simplesect>
<simplesect>
<title>Rsyslog</title>
<para><systemitem>rsyslog</systemitem> is a useful tool for setting up a centralized
log server across multiple machines. We briefly describe the configuration to set up
an <systemitem>rsyslog</systemitem> server; a full treatment of
<systemitem>rsyslog</systemitem> is beyond the scope of this document. We assume
<systemitem>rsyslog</systemitem> has already been installed on your hosts
(default for most Linux distributions).</para>
<para>This example provides a minimal configuration for
<filename>/etc/rsyslog.conf</filename> on the log server host, which receives
the log files:</para>
<programlisting language="bash"># provides TCP syslog reception
$ModLoad imtcp
$InputTCPServerRun 1024</programlisting>
<para>Add a filter rule to <filename>/etc/rsyslog.conf</filename> which looks for a
host name. The example below uses <replaceable>compute-01</replaceable> as an
example of a compute host name:</para>
<programlisting language="bash">:hostname, isequal, "<replaceable>compute-01</replaceable>" /mnt/rsyslog/logs/compute-01.log</programlisting>
<para>On each compute host, create a file named
<filename>/etc/rsyslog.d/60-nova.conf</filename>, with the following
content:</para>
<programlisting language="bash"># prevent debug from dnsmasq with the daemon.none parameter
*.*;auth,authpriv.none,daemon.none,local0.none -/var/log/syslog
# Specify a log level of ERROR
local0.error @@172.20.1.43:1024</programlisting>
<para>Once you have created this file, restart your <systemitem>rsyslog</systemitem>
daemon. Error-level log messages on the compute hosts should now be sent to your log
server.</para>
</simplesect>
</section>
<xi:include href="section_compute-rootwrap.xml"/>
<xi:include href="section_compute-configure-migrations.xml"/>
<section xml:id="section_live-migration-usage">
<title>Migrate instances</title>
<para>Before starting migrations, review the <link linkend="section_configuring-compute-migrations">Configure migrations section</link>.</para>
<para>Migration provides a scheme to migrate running
instances from one OpenStack Compute server to another
OpenStack Compute server.</para>
<procedure>
<title>To migrate instances</title>
<step>
<para>Look at the running instances, to get the ID
of the instance you wish to migrate.</para>
<screen><prompt>$</prompt> <userinput>nova list</userinput>
<computeroutput><![CDATA[+--------------------------------------+------+--------+-----------------+
| ID | Name | Status |Networks |
+--------------------------------------+------+--------+-----------------+
| d1df1b5a-70c4-4fed-98b7-423362f2c47c | vm1 | ACTIVE | private=a.b.c.d |
| d693db9e-a7cf-45ef-a7c9-b3ecb5f22645 | vm2 | ACTIVE | private=e.f.g.h |
+--------------------------------------+------+--------+-----------------+]]></computeroutput></screen>
</step>
<step>
<para>Look at information associated with that instance. This example uses 'vm1'
from above.</para>
<screen><prompt>$</prompt> <userinput>nova show d1df1b5a-70c4-4fed-98b7-423362f2c47c</userinput>
<computeroutput><![CDATA[+-------------------------------------+----------------------------------------------------------+
| Property | Value |
+-------------------------------------+----------------------------------------------------------+
...
| OS-EXT-SRV-ATTR:host | HostB |
...
| flavor | m1.tiny |
| id | d1df1b5a-70c4-4fed-98b7-423362f2c47c |
| name | vm1 |
| private network | a.b.c.d |
| status | ACTIVE |
...
+-------------------------------------+----------------------------------------------------------+]]></computeroutput></screen>
<para>In this example, vm1 is running on HostB.</para>
</step>
<step>
<para>Select the server to which instances will be migrated:</para>
<screen><prompt>#</prompt> <userinput>nova service-list</userinput>
<computeroutput>+------------------+------------+----------+---------+-------+----------------------------+-----------------+
| Binary | Host | Zone | Status | State | Updated_at | Disabled Reason |
+------------------+------------+----------+---------+-------+----------------------------+-----------------+
| nova-consoleauth | HostA | internal | enabled | up | 2014-03-25T10:33:25.000000 | - |
| nova-scheduler | HostA | internal | enabled | up | 2014-03-25T10:33:25.000000 | - |
| nova-conductor | HostA | internal | enabled | up | 2014-03-25T10:33:27.000000 | - |
| nova-compute | HostB | nova | enabled | up | 2014-03-25T10:33:31.000000 | - |
| nova-compute | HostC | nova | enabled | up | 2014-03-25T10:33:31.000000 | - |
| nova-cert | HostA | internal | enabled | up | 2014-03-25T10:33:31.000000 | - |
+------------------+-----------------------+----------+---------+-------+----------------------------+-----------------+</computeroutput>
</screen>
<para>In this example, HostC can be picked up
because <systemitem class="service">nova-compute</systemitem>
is running on it.</para>
</step>
<step>
<para>Ensure that HostC has enough resources for
migration.</para>
<screen><prompt>#</prompt> <userinput>nova host-describe HostC</userinput>
<computeroutput>+-----------+------------+-----+-----------+---------+
| HOST | PROJECT | cpu | memory_mb | disk_gb |
+-----------+------------+-----+-----------+---------+
| HostC | (total) | 16 | 32232 | 878 |
| HostC | (used_now) | 13 | 21284 | 442 |
| HostC | (used_max) | 13 | 21284 | 442 |
| HostC | p1 | 13 | 21284 | 442 |
| HostC | p2 | 13 | 21284 | 442 |
+-----------+------------+-----+-----------+---------+</computeroutput>
</screen>
<itemizedlist>
<listitem>
<para><emphasis role="bold"
>cpu:</emphasis>the number of
cpu</para>
</listitem>
<listitem>
<para><emphasis role="bold">memory_mb:</emphasis>total amount of memory
(in MB)</para>
</listitem>
<listitem>
<para><emphasis role="bold">disk_gb:</emphasis>total amount of space for
NOVA-INST-DIR/instances (in GB)</para>
</listitem>
<listitem>
<para><emphasis role="bold">1st line shows </emphasis>total amount of
resources for the physical server.</para>
</listitem>
<listitem>
<para><emphasis role="bold">2nd line shows </emphasis>currently used
resources.</para>
</listitem>
<listitem>
<para><emphasis role="bold">3rd line shows </emphasis>maximum used
resources.</para>
</listitem>
<listitem>
<para><emphasis role="bold">4th line and
under</emphasis> shows the resource
for each project.</para>
</listitem>
</itemizedlist>
</step>
<step>
<para>Use the <command>nova live-migration</command> command to migrate the
instances:<screen><prompt>$</prompt> <userinput>nova live-migration <replaceable>server</replaceable> <replaceable>host_name</replaceable> </userinput></screen></para>
<para>Where <replaceable>server</replaceable> can be either the server's ID or name.
For example:</para>
<screen><prompt>$</prompt> <userinput>nova live-migration d1df1b5a-70c4-4fed-98b7-423362f2c47c HostC</userinput><computeroutput>
<![CDATA[Migration of d1df1b5a-70c4-4fed-98b7-423362f2c47c initiated.]]></computeroutput></screen>
<para>Ensure instances are migrated successfully with <command>nova
list</command>. If instances are still running on HostB, check log files
(src/dest <systemitem class="service">nova-compute</systemitem> and <systemitem
class="service">nova-scheduler</systemitem>) to determine why. <note>
<para>Although the <command>nova</command> command is called
<command>live-migration</command>, under the default Compute
configuration options the instances are suspended before
migration.</para>
<para>For more details, see <link
xlink:href="http://docs.openstack.org/trunk/config-reference/content/configuring-openstack-compute-basics.html"
>Configure migrations</link> in <citetitle>OpenStack Configuration
Reference</citetitle>.</para>
</note>
</para>
</step>
</procedure>
</section>
<section xml:id="section_nova-compute-node-down">
<title>Recover from a failed compute node</title>
<para>If you have deployed Compute with a shared file
system, you can quickly recover from a failed compute
node. Of the two methods covered in these sections,
the evacuate API is the preferred method even in the
absence of shared storage. The evacuate API provides
many benefits over manual recovery, such as
re-attachment of volumes and floating IPs.</para>
<xi:include href="../../common/section_cli_nova_evacuate.xml"/>
<section xml:id="nova-compute-node-down-manual-recovery">
<title>Manual recovery</title>
<para>For KVM/libvirt compute node recovery, see the previous section. Use the
following procedure for all other hypervisors.</para>
<procedure>
<title>To work with host information</title>
<step>
<para>Identify the VMs on the affected hosts, using tools such as a
combination of <literal>nova list</literal> and <literal>nova show</literal>
or <literal>euca-describe-instances</literal>. Here's an example using the
EC2 API - instance i-000015b9 that is running on node np-rcc54:</para>
<programlisting language="bash">i-000015b9 at3-ui02 running nectarkey (376, np-rcc54) 0 m1.xxlarge 2012-06-19T00:48:11.000Z 115.146.93.60</programlisting>
</step>
<step>
<para>You can review the status of the host by using the Compute database.
Some of the important information is highlighted below. This example
converts an EC2 API instance ID into an OpenStack ID; if you used the
<literal>nova</literal> commands, you can substitute the ID directly.
You can find the credentials for your database in
<filename>/etc/nova.conf</filename>.</para>
<programlisting language="bash">SELECT * FROM instances WHERE id = CONV('15b9', 16, 10) \G;
*************************** 1. row ***************************
created_at: 2012-06-19 00:48:11
updated_at: 2012-07-03 00:35:11
deleted_at: NULL
...
id: 5561
...
power_state: 5
vm_state: shutoff
...
hostname: at3-ui02
host: np-rcc54
...
uuid: 3f57699a-e773-4650-a443-b4b37eed5a06
...
task_state: NULL
...</programlisting>
</step>
</procedure>
<procedure>
<title>To recover the VM</title>
<step>
<para>When you know the status of the VM on the failed host, determine to
which compute host the affected VM should be moved. For example, run the
following database command to move the VM to np-rcc46:</para>
<programlisting language="bash">UPDATE instances SET host = 'np-rcc46' WHERE uuid = '3f57699a-e773-4650-a443-b4b37eed5a06'; </programlisting>
</step>
<step>
<para>If using a hypervisor that relies on libvirt (such as KVM), it is a
good idea to update the <literal>libvirt.xml</literal> file (found in
<literal>/var/lib/nova/instances/[instance ID]</literal>). The important
changes to make are:</para>
<para>
<itemizedlist>
<listitem>
<para>Change the <literal>DHCPSERVER</literal> value to the host IP
address of the compute host that is now the VM's new
home.</para>
</listitem>
<listitem>
<para>Update the VNC IP if it isn't already to:
<literal>0.0.0.0</literal>.</para>
</listitem>
</itemizedlist>
</para>
</step>
<step>
<para>Reboot the VM:</para>
<screen><prompt>$</prompt> <userinput>nova reboot --hard 3f57699a-e773-4650-a443-b4b37eed5a06</userinput></screen>
</step>
</procedure>
<para>In theory, the above database update and <literal>nova
reboot</literal> command are all that is required to recover a VM from a
failed host. However, if further problems occur, consider looking at
recreating the network filter configuration using <literal>virsh</literal>,
restarting the Compute services or updating the <literal>vm_state</literal>
and <literal>power_state</literal> in the Compute database.</para>
</section>
</section>
<section xml:id="section_nova-uid-mismatch">
<title>Recover from a UID/GID mismatch</title>
<para>When running OpenStack compute, using a shared file
system or an automated configuration tool, you could
encounter a situation where some files on your compute
node are using the wrong UID or GID. This causes a
raft of errors, such as being unable to live migrate,
or start virtual machines.</para>
<para>The following procedure runs on <systemitem class="service"
>nova-compute</systemitem> hosts, based on the KVM hypervisor, and could help to
restore the situation:</para>
<procedure>
<title>To recover from a UID/GID mismatch</title>
<step>
<para>Ensure you don't use numbers that are already used for some other
user/group.</para>
</step>
<step>
<para>Set the nova uid in <filename>/etc/passwd</filename> to the same number in
all hosts (for example, 112).</para>
</step>
<step>
<para>Set the libvirt-qemu uid in
<filename>/etc/passwd</filename> to the
same number in all hosts (for example,
119).</para>
</step>
<step>
<para>Set the nova group in
<filename>/etc/group</filename> file to
the same number in all hosts (for example,
120).</para>
</step>
<step>
<para>Set the libvirtd group in
<filename>/etc/group</filename> file to
the same number in all hosts (for example,
119).</para>
</step>
<step>
<para>Stop the services on the compute
node.</para>
</step>
<step>
<para>Change all the files owned by user nova or
by group nova. For example:</para>
<programlisting language="bash">find / -uid 108 -exec chown nova {} \; # note the 108 here is the old nova uid before the change
find / -gid 120 -exec chgrp nova {} \;</programlisting>
</step>
<step>
<para>Repeat the steps for the libvirt-qemu owned files if those needed to
change.</para>
</step>
<step>
<para>Restart the services.</para>
</step>
<step>
<para>Now you can run the <command>find</command>
command to verify that all files using the
correct identifiers.</para>
</step>
</procedure>
</section>
<section xml:id="section_nova-disaster-recovery-process">
<title>Compute disaster recovery process</title>
<para>Use the following procedures to manage your cloud after a disaster, and to easily
back up its persistent storage volumes. Backups <emphasis role="bold">are</emphasis>
mandatory, even outside of disaster scenarios.</para>
<para>For a DRP definition, see <link
xlink:href="http://en.wikipedia.org/wiki/Disaster_Recovery_Plan"
>http://en.wikipedia.org/wiki/Disaster_Recovery_Plan</link>.</para>
<simplesect>
<title>A- The disaster recovery process
presentation</title>
<para>A disaster could happen to several components of
your architecture: a disk crash, a network loss, a
power cut, and so on. In this example, assume the
following set up:</para>
<orderedlist>
<listitem>
<para>A cloud controller (<systemitem>nova-api</systemitem>,
<systemitem>nova-objecstore</systemitem>,
<systemitem>nova-network</systemitem>)</para>
</listitem>
<listitem>
<para>A compute node (<systemitem
class="service"
>nova-compute</systemitem>)</para>
</listitem>
<listitem>
<para>A Storage Area Network used by
<systemitem class="service"
>cinder-volumes</systemitem> (aka
SAN)</para>
</listitem>
</orderedlist>
<para>The disaster example is the worst one: a power
loss. That power loss applies to the three
components. <emphasis role="italic">Let's see what
runs and how it runs before the
crash</emphasis>:</para>
<itemizedlist>
<listitem>
<para>From the SAN to the cloud controller, we
have an active iscsi session (used for the
"cinder-volumes" LVM's VG).</para>
</listitem>
<listitem>
<para>From the cloud controller to the compute node, we also have active
iscsi sessions (managed by <systemitem class="service"
>cinder-volume</systemitem>).</para>
</listitem>
<listitem>
<para>For every volume, an iscsi session is made (so 14 ebs volumes equals
14 sessions).</para>
</listitem>
<listitem>
<para>From the cloud controller to the compute node, we also have iptables/
ebtables rules which allow access from the cloud controller to the running
instance.</para>
</listitem>
<listitem>
<para>And at least, from the cloud controller to the compute node; saved
into database, the current state of the instances (in that case "running" ),
and their volumes attachment (mount point, volume ID, volume status, and so
on.)</para>
</listitem>
</itemizedlist>
<para>Now, after the power loss occurs and all
hardware components restart, the situation is as
follows:</para>
<itemizedlist>
<listitem>
<para>From the SAN to the cloud, the ISCSI
session no longer exists.</para>
</listitem>
<listitem>
<para>From the cloud controller to the compute
node, the ISCSI sessions no longer exist.
</para>
</listitem>
<listitem>
<para>From the cloud controller to the compute node, the iptables and
ebtables are recreated, since, at boot,
<systemitem>nova-network</systemitem> reapplies the
configurations.</para>
</listitem>
<listitem>
<para>From the cloud controller, instances are in a shutdown state (because
they are no longer running)</para>
</listitem>
<listitem>
<para>In the database, data was not updated at all, since Compute could not
have anticipated the crash.</para>
</listitem>
</itemizedlist>
<para>Before going further, and to prevent the administrator from making fatal
mistakes,<emphasis role="bold"> the instances won't be lost</emphasis>, because
no "<command role="italic">destroy</command>" or "<command role="italic"
>terminate</command>" command was invoked, so the files for the instances remain
on the compute node.</para>
<para>Perform these tasks in this exact order. <emphasis role="underline">Any extra
step would be dangerous at this stage</emphasis> :</para>
<para>
<orderedlist>
<listitem>
<para>Get the current relation from a
volume to its instance, so that you
can recreate the attachment.</para>
</listitem>
<listitem>
<para>Update the database to clean the
stalled state. (After that, you cannot
perform the first step).</para>
</listitem>
<listitem>
<para>Restart the instances. In other
words, go from a shutdown to running
state.</para>
</listitem>
<listitem>
<para>After the restart, reattach the volumes to their respective
instances (optional).</para>
</listitem>
<listitem>
<para>SSH into the instances to reboot them.</para>
</listitem>
</orderedlist>
</para>
</simplesect>
<simplesect>
<title>B - Disaster recovery</title>
<procedure>
<title>To perform disaster recovery</title>
<step>
<title>Get the instance-to-volume
relationship</title>
<para>You must get the current relationship from a volume to its instance,
because you will re-create the attachment.</para>
<para>You can find this relationship by running <command>nova
volume-list</command>. Note that the <command>nova</command> client
includes the ability to get volume information from Block Storage.</para>
</step>
<step>
<title>Update the database</title>
<para>Update the database to clean the stalled state. You must restore for
every volume, using these queries to clean up the database:</para>
<screen><prompt>mysql></prompt> <userinput>use cinder;</userinput>
<prompt>mysql></prompt> <userinput>update volumes set mountpoint=NULL;</userinput>
<prompt>mysql></prompt> <userinput>update volumes set status="available" where status &lt;&gt;"error_deleting";</userinput>
<prompt>mysql></prompt> <userinput>update volumes set attach_status="detached";</userinput>
<prompt>mysql></prompt> <userinput>update volumes set instance_id=0;</userinput></screen>
<para>Then, when you run <command>nova volume-list</command> commands, all
volumes appear in the listing.</para>
</step>
<step>
<title>Restart instances</title>
<para>Restart the instances using the <command>nova reboot
<replaceable>$instance</replaceable></command> command.</para>
<para>At this stage, depending on your image, some instances completely
reboot and become reachable, while others stop on the "plymouth"
stage.</para>
</step>
<step>
<title>DO NOT reboot a second time</title>
<para>Do not reboot instances that are stopped at this point. Instance state
depends on whether you added an <filename>/etc/fstab</filename> entry for
that volume. Images built with the <package>cloud-init</package> package
remain in a pending state, while others skip the missing volume and start.
The idea of that stage is only to ask nova to reboot every instance, so the
stored state is preserved. For more information about
<package>cloud-init</package>, see <link
xlink:href="https://help.ubuntu.com/community/CloudInit"
>help.ubuntu.com/community/CloudInit</link>.</para>
</step>
<step>
<title>Reattach volumes</title>
<para>After the restart, you can reattach the volumes to their respective
instances. Now that <command>nova</command> has restored the right status,
it is time to perform the attachments through a <command>nova
volume-attach</command></para>
<para>This simple snippet uses the created
file:</para>
<programlisting language="bash">#!/bin/bash
while read line; do
volume=`echo $line | $CUT -f 1 -d " "`
instance=`echo $line | $CUT -f 2 -d " "`
mount_point=`echo $line | $CUT -f 3 -d " "`
echo "ATTACHING VOLUME FOR INSTANCE - $instance"
nova volume-attach $instance $volume $mount_point
sleep 2
done &lt; $volumes_tmp_file</programlisting>
<para>At that stage, instances that were
pending on the boot sequence (<emphasis
role="italic">plymouth</emphasis>)
automatically continue their boot, and
restart normally, while the ones that
booted see the volume.</para>
</step>
<step>
<title>SSH into instances</title>
<para>If some services depend on the volume, or if a volume has an entry
into <systemitem>fstab</systemitem>, it could be good to simply restart the
instance. This restart needs to be made from the instance itself, not
through <command>nova</command>. So, we SSH into the instance and perform a
reboot:</para>
<screen><prompt>#</prompt> <userinput>shutdown -r now</userinput></screen>
</step>
</procedure>
<para>By completing this procedure, you can
successfully recover your cloud.</para>
<note>
<para>Follow these guidelines:</para>
<itemizedlist>
<listitem>
<para>Use the <parameter> errors=remount</parameter> parameter in the
<filename>fstab</filename> file, which prevents data
corruption.</para>
<para>The system locks any write to the disk if it detects an I/O error.
This configuration option should be added into the <systemitem
class="service">cinder-volume</systemitem> server (the one which
performs the ISCSI connection to the SAN), but also into the instances'
<filename>fstab</filename> file.</para>
</listitem>
<listitem>
<para>Do not add the entry for the SAN's disks to the <systemitem
class="service">cinder-volume</systemitem>'s
<filename>fstab</filename> file.</para>
<para>Some systems hang on that step, which means you could lose access to
your cloud-controller. To re-run the session manually, you would run the
following command before performing the mount:
<screen><prompt>#</prompt> <userinput>iscsiadm -m discovery -t st -p $SAN_IP $ iscsiadm -m node --target-name $IQN -p $SAN_IP -l</userinput></screen></para>
</listitem>
<listitem>
<para>For your instances, if you have the whole <filename>/home/</filename>
directory on the disk, instead of emptying the
<filename>/home</filename> directory and map the disk on it, leave a
user's directory with the user's bash files and the
<filename>authorized_keys</filename> file.</para>
<para>This enables you to connect to the instance, even without the volume
attached, if you allow only connections through public keys.</para>
</listitem>
</itemizedlist>
</note>
</simplesect>
<simplesect>
<title>C - Scripted DRP</title>
<procedure>
<title>To use scripted DRP</title>
<para>You can download from <link
xlink:href="https://github.com/Razique/BashStuff/blob/master/SYSTEMS/OpenStack/SCR_5006_V00_NUAC-OPENSTACK-DRP-OpenStack.sh"
>here</link> a bash script which performs
these steps:</para>
<step>
<para>The "test mode" allows you to perform
that whole sequence for only one
instance.</para>
</step>
<step>
<para>To reproduce the power loss, connect to
the compute node which runs that same
instance and close the iscsi session.
<emphasis role="underline">Do not
detach the volume through
<command>nova
volume-detach</command></emphasis>,
but instead manually close the iscsi
session.</para>
</step>
<step>
<para>In this example, the iscsi session is
number 15 for that instance:</para>
<screen><prompt>#</prompt> <userinput>iscsiadm -m session -u -r 15</userinput></screen>
</step>
<step>
<para>Do not forget the <literal>-r</literal>
flag. Otherwise, you close ALL
sessions.</para>
</step>
</procedure>
</simplesect>
</section>
</section>