Copyedit of logging and monitoring chapter

Using the latest OReilly PDF, update spelling, grammar and markup
for te ops guide logging and monitoring chapter

Change-Id: Ib2452a47c4e9dc1201256113fa8a9dd1ba350f04
This commit is contained in:
Tom Fifield
2014-03-16 21:01:41 +11:00
committed by Anne Gentle
parent 3505b593ae
commit 3864dadb87

View File

@@ -13,15 +13,16 @@
<?dbhtml stop-chunking?> <?dbhtml stop-chunking?>
<title>Logging and Monitoring</title> <title>Logging and Monitoring</title>
<para>As an OpenStack cloud is composed of so many different <para>As an OpenStack cloud is composed of so many different
services, there are a large number of log files. This section services, there are a large number of log files. This chapter
aims to assist you in locating and working with them, and aims to assist you in locating and working with them and
other ways to track the status of your deployment.</para> describes other ways to track the status of your deployment.</para>
<section xml:id="where_are_logs"> <section xml:id="where_are_logs">
<title>Where Are the Logs?</title> <title>Where Are the Logs?</title>
<para>Most services use the convention of writing <para>Most services use the convention of writing
their log files to subdirectories of the <code>/var/log their log files to subdirectories of the <code>/var/log
directory</code>.</para> directory</code>, as listed in <link linkend="openstack-log-locations">OpenStack Log Locations</link>.</para>
<informaltable rules="all"> <table xml:id="openstack-log-locations" rules="all">
<caption>OpenStack Log Locations</caption>
<thead> <thead>
<tr> <tr>
<th>Node Type</th> <th>Node Type</th>
@@ -31,7 +32,7 @@
</thead> </thead>
<tbody> <tbody>
<tr> <tr>
<td><para>Cloud Controller</para></td> <td><para>Cloud controller</para></td>
<td><para> <td><para>
<code>nova-*</code> <code>nova-*</code>
</para></td> </para></td>
@@ -40,7 +41,7 @@
</para></td> </para></td>
</tr> </tr>
<tr> <tr>
<td><para>Cloud Controller</para></td> <td><para>Cloud controller</para></td>
<td><para> <td><para>
<code>glance-*</code> <code>glance-*</code>
</para></td> </para></td>
@@ -49,7 +50,7 @@
</para></td> </para></td>
</tr> </tr>
<tr> <tr>
<td><para>Cloud Controller</para></td> <td><para>Cloud controller</para></td>
<td><para> <td><para>
<code>cinder-*</code> <code>cinder-*</code>
</para></td> </para></td>
@@ -58,7 +59,7 @@
</para></td> </para></td>
</tr> </tr>
<tr> <tr>
<td><para>Cloud Controller</para></td> <td><para>Cloud controller</para></td>
<td><para> <td><para>
<code>keystone-*</code> <code>keystone-*</code>
</para></td> </para></td>
@@ -67,7 +68,7 @@
</para></td> </para></td>
</tr> </tr>
<tr> <tr>
<td><para>Cloud Controller</para></td> <td><para>Cloud controller</para></td>
<td><para> <td><para>
<code>neutron-*</code> <code>neutron-*</code>
</para></td> </para></td>
@@ -76,7 +77,7 @@
</para></td> </para></td>
</tr> </tr>
<tr> <tr>
<td><para>Cloud Controller</para></td> <td><para>Cloud controller</para></td>
<td><para>horizon</para></td> <td><para>horizon</para></td>
<td><para> <td><para>
<code>/var/log/apache2/</code> <code>/var/log/apache2/</code>
@@ -84,21 +85,21 @@
</tr> </tr>
<tr> <tr>
<td><para>All nodes</para></td> <td><para>All nodes</para></td>
<td><para>misc (Swift, <td><para>misc (swift,
dnsmasq)</para></td> dnsmasq)</para></td>
<td><para> <td><para>
<code>/var/log/syslog</code> <code>/var/log/syslog</code>
</para></td> </para></td>
</tr> </tr>
<tr> <tr>
<td><para>Compute Nodes</para></td> <td><para>Compute nodes</para></td>
<td><para>libvirt</para></td> <td><para>libvirt</para></td>
<td><para> <td><para>
<code>/var/log/libvirt/libvirtd.log</code> <code>/var/log/libvirt/libvirtd.log</code>
</para></td> </para></td>
</tr> </tr>
<tr> <tr>
<td><para>Compute Nodes</para></td> <td><para>Compute nodes</para></td>
<td><para>Console (boot up messages) for VM instances:</para></td> <td><para>Console (boot up messages) for VM instances:</para></td>
<td><para> <td><para>
<code>/var/lib/nova/instances/instance-&lt;instance <code>/var/lib/nova/instances/instance-&lt;instance
@@ -106,36 +107,36 @@
</para></td> </para></td>
</tr> </tr>
<tr> <tr>
<td><para>Block Storage Nodes</para></td> <td><para>Block Storage nodes</para></td>
<td><para>cinder-volume</para></td> <td><para>cinder-volume</para></td>
<td><para> <td><para>
<code>/var/log/cinder/cinder-volume.log</code> <code>/var/log/cinder/cinder-volume.log</code>
</para></td> </para></td>
</tr> </tr>
</tbody> </tbody>
</informaltable> </table>
</section> </section>
<section xml:id="how_to_read_logs"> <section xml:id="how_to_read_logs">
<title>Reading the Logs</title> <title>Reading the Logs</title>
<para>OpenStack services use the standard logging levels, at <para>OpenStack services use the standard logging levels, at
increasing severity: DEBUG, INFO, AUDIT, WARNING, ERROR, increasing severity: DEBUG, INFO, AUDIT, WARNING, ERROR,
CRITICAL, and TRACE. That is, messages only appear in the logs CRITICAL, and TRACE. That is, messages only appear in the logs
if they are more "severe" than the particular log level if they are more "severe" than the particular log level,
with DEBUG allowing all log statements through. For with DEBUG allowing all log statements through. For
example, TRACE is logged only if the software has a stack example, TRACE is logged only if the software has a stack
trace, while INFO is logged for every message including trace, while INFO is logged for every message including
those that are only for information.</para> those that are only for information.</para>
<para>To disable DEBUG-level logging, edit <para>To disable DEBUG-level logging, edit
<filename>/etc/nova/nova.conf</filename>:</para> <filename>/etc/nova/nova.conf</filename> as follows:</para>
<programlisting language="ini">debug=false</programlisting> <programlisting language="ini">debug=false</programlisting>
<para>Keystone is handled a little differently. To modify the <para>Keystone is handled a little differently. To modify the
logging level, edit the logging level, edit the
<filename>/etc/keystone/logging.conf</filename> file and look <filename>/etc/keystone/logging.conf</filename> file and look
at the <code>logger_root</code> and <code>handler_file</code> at the <code>logger_root</code> and <code>handler_file</code>
sections.</para> sections.</para>
<para>Logging for Horizon is configured in <para>Logging for horizon is configured in
<filename>/etc/openstack_dashboard/local_settings.py</filename>. <filename>/etc/openstack_dashboard/local_settings.py</filename>.
As Horizon is a Django web application, it follows the Because horizon is a Django web application, it follows the
<link xlink:title="Django Logging" <link xlink:title="Django Logging"
xlink:href="https://docs.djangoproject.com/en/dev/topics/logging/" xlink:href="https://docs.djangoproject.com/en/dev/topics/logging/"
>Django Logging</link> >Django Logging</link>
@@ -144,7 +145,7 @@
<para>The first step in finding the source of an error is <para>The first step in finding the source of an error is
typically to search for a CRITICAL, TRACE, or ERROR typically to search for a CRITICAL, TRACE, or ERROR
message in the log starting at the bottom of the log file.</para> message in the log starting at the bottom of the log file.</para>
<para>An example of a CRITICAL log message, with the <para>Here is an example of a CRITICAL log message, with the
corresponding TRACE (Python traceback) immediately corresponding TRACE (Python traceback) immediately
following:</para> following:</para>
<screen><computeroutput>2013-02-25 21:05:51 17409 CRITICAL cinder [-] Bad or unexpected response from the storage volume backend API: volume group <screen><computeroutput>2013-02-25 21:05:51 17409 CRITICAL cinder [-] Bad or unexpected response from the storage volume backend API: volume group
@@ -179,10 +180,10 @@
2013-02-25 21:05:51 17409 TRACE cinder</computeroutput></screen> 2013-02-25 21:05:51 17409 TRACE cinder</computeroutput></screen>
<para>In this example, cinder-volumes failed to start and has <para>In this example, cinder-volumes failed to start and has
provided a stack trace, since its volume back-end has been provided a stack trace, since its volume back-end has been
unable to setup the storage volume - probably because the unable to set up the storage volume&mdash;probably because the
LVM volume that is expected from the configuration does LVM volume that is expected from the configuration does
not exist.</para> not exist.</para>
<para>An example error log:</para> <para>Here is an example error log:</para>
<screen><computeroutput>2013-02-25 20:26:33 6619 ERROR nova.openstack.common.rpc.common [-] AMQP server on localhost:5672 is unreachable: <screen><computeroutput>2013-02-25 20:26:33 6619 ERROR nova.openstack.common.rpc.common [-] AMQP server on localhost:5672 is unreachable:
[Errno 111] ECONNREFUSED. Trying again in 23 seconds.</computeroutput></screen> [Errno 111] ECONNREFUSED. Trying again in 23 seconds.</computeroutput></screen>
<para>In this error, a nova service has failed to connect to <para>In this error, a nova service has failed to connect to
@@ -209,10 +210,10 @@
<code>faf7ded8-4a46-413b-b113-f19590746ffe</code>. If <code>faf7ded8-4a46-413b-b113-f19590746ffe</code>. If
you search for this string on the cloud controller in the you search for this string on the cloud controller in the
<filename>/var/log/nova-*.log</filename> files, it appears in <filename>/var/log/nova-*.log</filename> files, it appears in
<filename>nova-api.log</filename>, and <filename>nova-api.log</filename> and
<filename>nova-scheduler.log</filename>. If you search for <filename>nova-scheduler.log</filename>. If you search for
this on the compute nodes in this on the compute nodes in
<filename>/var/log/nova-*.log</filename>, it appears <filename>/var/log/nova-*.log</filename>, it appears in
<filename>nova-network.log</filename> and <filename>nova-network.log</filename> and
<filename>nova-compute.log</filename>. If no ERROR or CRITICAL <filename>nova-compute.log</filename>. If no ERROR or CRITICAL
messages appear, the most recent log entry that reports messages appear, the most recent log entry that reports
@@ -233,11 +234,11 @@
LOG = logging.getLogger(__name__)</programlisting> LOG = logging.getLogger(__name__)</programlisting>
<para>To add a DEBUG logging statement, you would do:</para> <para>To add a DEBUG logging statement, you would do:</para>
<programlisting language="python">LOG.debug("This is a custom debugging statement")</programlisting> <programlisting language="python">LOG.debug("This is a custom debugging statement")</programlisting>
<para>You may notice that all of the existing logging messages <para>You may notice that all the existing logging messages
are preceded by an underscore and surrounded by are preceded by an underscore and surrounded by
parentheses, for example:</para> parentheses, for example:</para>
<programlisting language="python">LOG.debug(_("Logging statement appears here"))</programlisting> <programlisting language="python">LOG.debug(_("Logging statement appears here"))</programlisting>
<para>This is used to support translation of logging messages <para>This formatting is used to support translation of logging messages
into different languages using the <link into different languages using the <link
xlink:href="http://docs.python.org/2/library/gettext.html" xlink:href="http://docs.python.org/2/library/gettext.html"
>gettext</link> >gettext</link>
@@ -256,9 +257,7 @@ LOG = logging.getLogger(__name__)</programlisting>
issues. Instead, we recommend you use the RabbitMQ web issues. Instead, we recommend you use the RabbitMQ web
management interface. Enable it on your cloud management interface. Enable it on your cloud
controller:</para> controller:</para>
<screen><prompt>#</prompt> <screen><prompt>#</prompt> <userinput>/usr/lib/rabbitmq/bin/rabbitmq-plugins enable rabbitmq_management</userinput></screen>
<userinput>/usr/lib/rabbitmq/bin/rabbitmq-plugins enable
rabbitmq_management</userinput></screen>
<screen><prompt>#</prompt> <userinput>service rabbitmq-server restart</userinput></screen> <screen><prompt>#</prompt> <userinput>service rabbitmq-server restart</userinput></screen>
<para>The RabbitMQ web management interface is accessible on <para>The RabbitMQ web management interface is accessible on
your cloud controller at http://localhost:55672.</para> your cloud controller at http://localhost:55672.</para>
@@ -271,11 +270,11 @@ LOG = logging.getLogger(__name__)</programlisting>
<screen><prompt>$</prompt> <userinput>dpkg -s rabbitmq-server | grep "Version:" <screen><prompt>$</prompt> <userinput>dpkg -s rabbitmq-server | grep "Version:"
Version: 2.7.1-0ubuntu4</userinput></screen> Version: 2.7.1-0ubuntu4</userinput></screen>
</note> </note>
<para>An alternative to enabling the RabbitMQ Web Management <para>An alternative to enabling the RabbitMQ web management
Interface is to use the <command>rabbitmqctl</command> commands. For example, interface is to use the <command>rabbitmqctl</command> commands. For example,
<command>rabbitmqctl list_queues| grep <command>rabbitmqctl list_queues| grep
cinder</command> displays any messages cinder</command> displays any messages
left in the queue. If there are, it's a possible sign that left in the queue. If any messages are there, it's a possible sign that
cinder services didn't connect properly to rabbitmq and cinder services didn't connect properly to rabbitmq and
might have to be restarted.</para> might have to be restarted.</para>
<para>Items to monitor for RabbitMQ include the number of <para>Items to monitor for RabbitMQ include the number of
@@ -287,14 +286,14 @@ Version: 2.7.1-0ubuntu4</userinput></screen>
<para>Because your cloud is most likely composed of many <para>Because your cloud is most likely composed of many
servers, you must check logs on each of those servers to servers, you must check logs on each of those servers to
properly piece an event together. A better solution is to properly piece an event together. A better solution is to
send the logs of all servers to a central location so they send the logs of all servers to a central location so that they
can all be accessed from the same area.</para> can all be accessed from the same area.</para>
<para>Ubuntu uses rsyslog as the default logging service. <para>Ubuntu uses rsyslog as the default logging service.
Since it is natively able to send logs to a remote Since it is natively able to send logs to a remote
location, you don't have to install anything extra to location, you don't have to install anything extra to
enable this feature, just modify the configuration file. enable this feature, just modify the configuration file.
In doing this, consider running your logging over a In doing this, consider running your logging over a
management network, or using an encrypted VPN to avoid management network or using an encrypted VPN to avoid
interception.</para> interception.</para>
<section xml:id="rsyslog_client_config"> <section xml:id="rsyslog_client_config">
<title>rsyslog Client Configuration</title> <title>rsyslog Client Configuration</title>
@@ -327,8 +326,8 @@ syslog_log_facility=LOG_LOCAL3</programlisting>
following line:</para> following line:</para>
<programlisting language="ini">*.* @192.168.1.10</programlisting> <programlisting language="ini">*.* @192.168.1.10</programlisting>
<para>This instructs rsyslog to send all logs to the IP <para>This instructs rsyslog to send all logs to the IP
listed. In this example, the IP points to the Cloud listed. In this example, the IP points to the cloud
Controller.</para> controller.</para>
</section> </section>
<section xml:id="rsyslog_server_config"> <section xml:id="rsyslog_server_config">
<title>rsyslog Server Configuration</title> <title>rsyslog Server Configuration</title>
@@ -360,7 +359,7 @@ $template DynFile,"/var/log/rsyslog/%HOSTNAME%/syslog.log"
local0.* ?NovaFile local0.* ?NovaFile
local0.* ?NovaAll local0.* ?NovaAll
&amp; ~</programlisting> &amp; ~</programlisting>
<para>The above example configuration handles the nova service only. <para>This example configuration handles the nova service only.
It first configures rsyslog to act as a server that runs on port It first configures rsyslog to act as a server that runs on port
514. Next, it creates a series of logging templates. Logging 514. Next, it creates a series of logging templates. Logging
templates control where received logs are stored. Using templates control where received logs are stored. Using
@@ -378,7 +377,7 @@ local0.* ?NovaAll
</para> </para>
</listitem> </listitem>
</itemizedlist> </itemizedlist>
<para>This is useful as logs from c02.example.com go to:</para> <para>This is useful, as logs from c02.example.com go to:</para>
<itemizedlist> <itemizedlist>
<listitem> <listitem>
<para> <para>
@@ -397,10 +396,12 @@ local0.* ?NovaAll
</section> </section>
</section> </section>
<section xml:id="stacktach"> <section xml:id="stacktach">
<!-- FIXME This section needs updating, especially with the advent of
ceilometer -->
<title>StackTach</title> <title>StackTach</title>
<para>StackTach is a tool created by Rackspace to collect and <para>StackTach is a tool created by Rackspace to collect and
report the notifications sent by <code>nova</code>. report the notifications sent by <code>nova</code>.
Notifications are essentially the same as logs, but can be Notifications are essentially the same as logs but can be
much more detailed. A good overview of notifications can much more detailed. A good overview of notifications can
be found at <link xlink:title="StackTach GitHub repo" be found at <link xlink:title="StackTach GitHub repo"
xlink:href="https://wiki.openstack.org/wiki/SystemUsageData" xlink:href="https://wiki.openstack.org/wiki/SystemUsageData"
@@ -433,7 +434,7 @@ notification_driver=nova.openstack.common.notifier.rabbit_notifier</programlisti
capable of executing arbitrary commands to check the capable of executing arbitrary commands to check the
status of server and network services, remotely status of server and network services, remotely
executing arbitrary commands directly on servers, and executing arbitrary commands directly on servers, and
allow servers to push notifications back in the form allowing servers to push notifications back in the form
of passive monitoring. Nagios has been around since of passive monitoring. Nagios has been around since
1999. Although newer monitoring services are 1999. Although newer monitoring services are
available, Nagios is a tried-and-true systems available, Nagios is a tried-and-true systems
@@ -442,9 +443,9 @@ notification_driver=nova.openstack.common.notifier.rabbit_notifier</programlisti
<section xml:id="process_monitoring"> <section xml:id="process_monitoring">
<title>Process Monitoring</title> <title>Process Monitoring</title>
<para>A basic type of alert monitoring is to simply check <para>A basic type of alert monitoring is to simply check
and see if a required process is running. For example, and see whether a required process is running. For example,
ensure that the <code>nova-api</code> service is ensure that the <code>nova-api</code> service is
running on the Cloud Controller:</para> running on the cloud controller:</para>
<screen><prompt>#</prompt> <userinput>ps aux | grep nova-api</userinput> <screen><prompt>#</prompt> <userinput>ps aux | grep nova-api</userinput>
<computeroutput>nova 12786 0.0 0.0 37952 1312 ? Ss Feb11 0:00 su -s /bin/sh -c exec nova-api --config-file=/etc/nova/nova.conf nova <computeroutput>nova 12786 0.0 0.0 37952 1312 ? Ss Feb11 0:00 su -s /bin/sh -c exec nova-api --config-file=/etc/nova/nova.conf nova
nova 12787 0.0 0.1 135764 57400 ? S Feb11 0:01 /usr/bin/python /usr/bin/nova-api --config-file=/etc/nova/nova.conf nova 12787 0.0 0.1 135764 57400 ? S Feb11 0:01 /usr/bin/python /usr/bin/nova-api --config-file=/etc/nova/nova.conf
@@ -477,22 +478,22 @@ root 24121 0.0 0.0 11688 912 pts/5 S+ 13:07 0:00 grep nova-api</computeroutput><
more resources are critically low. While the more resources are critically low. While the
monitoring thresholds should be tuned to your specific monitoring thresholds should be tuned to your specific
OpenStack environment, monitoring resource usage is OpenStack environment, monitoring resource usage is
not specific to OpenStack at all any generic type of not specific to OpenStack at allany generic type of
alert will work fine.</para> alert will work fine.</para>
<para>Some of the resources that you want to monitor <para>Some of the resources that you want to monitor
include:</para> include:</para>
<itemizedlist> <itemizedlist>
<listitem> <listitem>
<para>Disk Usage</para> <para>Disk usage</para>
</listitem> </listitem>
<listitem> <listitem>
<para>Server Load</para> <para>Server load</para>
</listitem> </listitem>
<listitem> <listitem>
<para>Memory Usage</para> <para>Memory usage</para>
</listitem> </listitem>
<listitem> <listitem>
<para>Network IO</para> <para>Network I/O</para>
</listitem> </listitem>
<listitem> <listitem>
<para>Available vCPUs</para> <para>Available vCPUs</para>
@@ -512,8 +513,8 @@ root 24121 0.0 0.0 11688 912 pts/5 S+ 13:07 0:00 grep nova-api</computeroutput><
configuration:</para> configuration:</para>
<programlisting><?db-font-size 75%?>command[check_all_disks]=/usr/lib/nagios/plugins/check_disk -w $ARG1$ -c $ARG2$ -e</programlisting> <programlisting><?db-font-size 75%?>command[check_all_disks]=/usr/lib/nagios/plugins/check_disk -w $ARG1$ -c $ARG2$ -e</programlisting>
<para>Nagios alerts you with a WARNING when any disk on <para>Nagios alerts you with a WARNING when any disk on
the compute node is 80% full and CRITICAL when 90% is the compute node is 80 percent full and CRITICAL when 90
full.</para> percent is full.</para>
</section> </section>
<section xml:id="metering_telemetry"> <section xml:id="metering_telemetry">
<title>Metering and Telemetry with Ceilometer</title> <title>Metering and Telemetry with Ceilometer</title>
@@ -530,13 +531,13 @@ root 24121 0.0 0.0 11688 912 pts/5 S+ 13:07 0:00 grep nova-api</computeroutput><
xlink:href="http://docs.openstack.org/developer/ceilometer/" xlink:href="http://docs.openstack.org/developer/ceilometer/"
>http://docs.openstack.org/developer/ceilometer/</link>.</para></section> >http://docs.openstack.org/developer/ceilometer/</link>.</para></section>
<section xml:id="os_resources"> <section xml:id="os_resources">
<title>OpenStack-specific Resources</title> <title>OpenStack-Specific Resources</title>
<para>Resources such as memory, disk, and CPU are generic <para>Resources such as memory, disk, and CPU are generic
resources that all servers (even non-OpenStack resources that all servers (even non-OpenStack
servers) have and are important to the overall health servers) have and are important to the overall health
of the server. When dealing with OpenStack of the server. When dealing with OpenStack
specifically, these resources are important for a specifically, these resources are important for a
second reason: ensuring enough are available in order second reason: ensuring that enough are available
to launch instances. There are a few ways you can see to launch instances. There are a few ways you can see
OpenStack resource usage.</para> OpenStack resource usage.</para>
<para>The first is through the <code>nova</code> <para>The first is through the <code>nova</code>
@@ -545,14 +546,14 @@ root 24121 0.0 0.0 11688 912 pts/5 S+ 13:07 0:00 grep nova-api</computeroutput><
<para>This command displays a list of how many instances a <para>This command displays a list of how many instances a
tenant has running and some light usage statistics tenant has running and some light usage statistics
about the combined instances. This command is useful about the combined instances. This command is useful
for a quick overview of your cloud, but doesn't really for a quick overview of your cloud, but it doesn't really
get into a lot of details.</para> get into a lot of details.</para>
<para>Next, the <code>nova</code> database contains three <para>Next, the <code>nova</code> database contains three
tables that store usage information.</para> tables that store usage information.</para>
<para>The <code>nova.quotas</code> and <para>The <code>nova.quotas</code> and
<code>nova.quota_usages</code> tables store quota <code>nova.quota_usages</code> tables store quota
information. If a tenant's quota is different than the information. If a tenant's quota is different from the
default quota settings, their quota is stored in default quota settings, its quota is stored in the
<code>nova.quotas</code> table. For <code>nova.quotas</code> table. For
example:</para> example:</para>
<screen><prompt>mysql&gt;</prompt> <userinput>select project_id, resource, hard_limit from quotas;</userinput> <screen><prompt>mysql&gt;</prompt> <userinput>select project_id, resource, hard_limit from quotas;</userinput>
@@ -587,12 +588,12 @@ root 24121 0.0 0.0 11688 912 pts/5 S+ 13:07 0:00 grep nova-api</computeroutput><
<para>By comparing a tenant's hard limit with their <para>By comparing a tenant's hard limit with their
current resource usage, you can see their usage current resource usage, you can see their usage
percentage. For example, if this tenant is using 1 percentage. For example, if this tenant is using 1
Floating IP out of 10, then they are using 10% of floating IP out of 10, then they are using 10 percent of
their Floating IP quota. Rather than doing the their floating IP quota. Rather than doing the
calculation manually, you can use SQL or the scripting calculation manually, you can use SQL or the scripting
language of your choice and create a formatted language of your choice and create a formatted
report:</para> report:</para>
<screen><computeroutput>+----------------------------------+------------+-------------+---------------+ <screen><computeroutput>+----------------------------------+------------+-------------+---------------+
| some_tenant | | some_tenant |
+-----------------------------------+------------+------------+---------------+ +-----------------------------------+------------+------------+---------------+
| Resource | Used | Limit | | | Resource | Used | Limit | |
@@ -613,8 +614,8 @@ root 24121 0.0 0.0 11688 912 pts/5 S+ 13:07 0:00 grep nova-api</computeroutput><
| security_groups | 0 | 10 | 0 % | | security_groups | 0 | 10 | 0 % |
| volumes | 2 | 10 | 20 % | | volumes | 2 | 10 | 20 % |
+-----------------------------------+------------+------------+---------------+</computeroutput></screen> +-----------------------------------+------------+------------+---------------+</computeroutput></screen>
<para>The above was generated using a custom script which <para>The above information was generated by using a custom script
can be found on GitHub that can be found on GitHub
(https://github.com/cybera/novac/blob/dev/libexec/novac-quota-report).</para> (https://github.com/cybera/novac/blob/dev/libexec/novac-quota-report).</para>
<note> <note>
<para>This script is specific to a certain OpenStack <para>This script is specific to a certain OpenStack
@@ -627,15 +628,15 @@ root 24121 0.0 0.0 11688 912 pts/5 S+ 13:07 0:00 grep nova-api</computeroutput><
<title>Intelligent Alerting</title> <title>Intelligent Alerting</title>
<para>Intelligent alerting can be thought of as a form of <para>Intelligent alerting can be thought of as a form of
continuous integration for operations. For example, continuous integration for operations. For example,
you can easily check to see if the Image Service is up and you can easily check to see whether the Image Service is up and
running by ensuring that the <code>glance-api</code> running by ensuring that the <code>glance-api</code>
and <code>glance-registry</code> processes are running and <code>glance-registry</code> processes are running
or by seeing if <code>glace-api</code> is responding or by seeing whether <code>glace-api</code> is responding
on port 9292.</para> on port 9292.</para>
<para>But how can you tell if images are being <para>But how can you tell whether images are being
successfully uploaded to the Image Service? Maybe the successfully uploaded to the Image Service? Maybe the
disk that Image Service is storing the images on is disk that Image Service is storing the images on is
full or the S3 back-end is down. You could naturally full or the S3 backend is down. You could naturally
check this by doing a quick image upload:</para> check this by doing a quick image upload:</para>
<programlisting language="bash">#!/bin/bash <programlisting language="bash">#!/bin/bash
# #
@@ -649,35 +650,35 @@ glance image-create --name='cirros image' --is-public=true --container-format=ba
6_64-disk.img</programlisting> 6_64-disk.img</programlisting>
<para>By taking this script and rolling it into an alert <para>By taking this script and rolling it into an alert
for your monitoring system (such as Nagios), you now for your monitoring system (such as Nagios), you now
have an automated way of ensuring image uploads to the have an automated way of ensuring that image uploads to the
Image Catalog are working.</para> Image Catalog are working.</para>
<note> <note>
<para>You must remove the image after each test. Even <para>You must remove the image after each test. Even
better, test whether you can successfully delete better, test whether you can successfully delete
an image from the Image Service.</para> an image from the Image Service.</para>
</note> </note>
<para>Intelligent alerting takes a considerable more <para>Intelligent alerting takes considerably more
amount of time to plan and implement than the other time to plan and implement than the other
alerts described in this chapter. A good outline to alerts described in this chapter. A good outline to
implement intelligent alerting is:</para> implement intelligent alerting is:</para>
<itemizedlist> <itemizedlist>
<listitem> <listitem>
<para>Review common actions in your cloud</para> <para>Review common actions in your cloud.</para>
</listitem> </listitem>
<listitem> <listitem>
<para>Create ways to automatically test these <para>Create ways to automatically test these
actions</para> actions.</para>
</listitem> </listitem>
<listitem> <listitem>
<para>Roll these tests into an alerting <para>Roll these tests into an alerting
system</para> system.</para>
</listitem> </listitem>
</itemizedlist> </itemizedlist>
<para>Some other examples for Intelligent Alerting <para>Some other examples for Intelligent Alerting
include:</para> include:</para>
<itemizedlist> <itemizedlist>
<listitem> <listitem>
<para>Can instances launch and destroyed?</para> <para>Can instances launch and be destroyed?</para>
</listitem> </listitem>
<listitem> <listitem>
<para>Can users be created?</para> <para>Can users be created?</para>
@@ -693,7 +694,7 @@ glance image-create --name='cirros image' --is-public=true --container-format=ba
<section xml:id="trending"> <section xml:id="trending">
<title>Trending</title> <title>Trending</title>
<para>Trending can give you great insight into how your <para>Trending can give you great insight into how your
cloud is performing day to day. For example, if a busy cloud is performing day to day. You can learn, for example, if a busy
day was simply a rare occurrence or if you should day was simply a rare occurrence or if you should
start adding new compute nodes.</para> start adding new compute nodes.</para>
<para>Trending takes a slightly different approach than <para>Trending takes a slightly different approach than
@@ -733,7 +734,7 @@ glance image-create --name='cirros image' --is-public=true --container-format=ba
<para>As an example, recording <code>nova-api</code> usage <para>As an example, recording <code>nova-api</code> usage
can allow you to track the need to scale your cloud can allow you to track the need to scale your cloud
controller. By keeping an eye on <code>nova-api</code> controller. By keeping an eye on <code>nova-api</code>
requests, you can determine if you need to spawn more requests, you can determine whether you need to spawn more
nova-api processes or go as far as introducing an nova-api processes or go as far as introducing an
entirely new server to run <code>nova-api</code>. To entirely new server to run <code>nova-api</code>. To
get an approximate count of the requests, look for get an approximate count of the requests, look for
@@ -762,10 +763,10 @@ glance image-create --name='cirros image' --is-public=true --container-format=ba
<title>Summary</title> <title>Summary</title>
<para>For stable operations, you want to detect failure promptly and <para>For stable operations, you want to detect failure promptly and
determine causes efficiently. With a distributed system, it's even determine causes efficiently. With a distributed system, it's even
more important to track the right items to meet a service level target. more important to track the right items to meet a service-level target.
Learning where these logs are located in the file system or API gives Learning where these logs are located in the file system or API gives
you an advantage. Plus, we have discussed how to read, interpret, and you an advantage. This chapter also showed how to read, interpret, and
manipulate information from OpenStack services so you can monitor manipulate information from OpenStack services so that you can monitor
effectively.</para> effectively.</para>
</section> </section>
</chapter> </chapter>