Copyedit of logging and monitoring chapter
Using the latest OReilly PDF, update spelling, grammar and markup for te ops guide logging and monitoring chapter Change-Id: Ib2452a47c4e9dc1201256113fa8a9dd1ba350f04
This commit is contained in:
@@ -13,15 +13,16 @@
|
|||||||
<?dbhtml stop-chunking?>
|
<?dbhtml stop-chunking?>
|
||||||
<title>Logging and Monitoring</title>
|
<title>Logging and Monitoring</title>
|
||||||
<para>As an OpenStack cloud is composed of so many different
|
<para>As an OpenStack cloud is composed of so many different
|
||||||
services, there are a large number of log files. This section
|
services, there are a large number of log files. This chapter
|
||||||
aims to assist you in locating and working with them, and
|
aims to assist you in locating and working with them and
|
||||||
other ways to track the status of your deployment.</para>
|
describes other ways to track the status of your deployment.</para>
|
||||||
<section xml:id="where_are_logs">
|
<section xml:id="where_are_logs">
|
||||||
<title>Where Are the Logs?</title>
|
<title>Where Are the Logs?</title>
|
||||||
<para>Most services use the convention of writing
|
<para>Most services use the convention of writing
|
||||||
their log files to subdirectories of the <code>/var/log
|
their log files to subdirectories of the <code>/var/log
|
||||||
directory</code>.</para>
|
directory</code>, as listed in <link linkend="openstack-log-locations">OpenStack Log Locations</link>.</para>
|
||||||
<informaltable rules="all">
|
<table xml:id="openstack-log-locations" rules="all">
|
||||||
|
<caption>OpenStack Log Locations</caption>
|
||||||
<thead>
|
<thead>
|
||||||
<tr>
|
<tr>
|
||||||
<th>Node Type</th>
|
<th>Node Type</th>
|
||||||
@@ -31,7 +32,7 @@
|
|||||||
</thead>
|
</thead>
|
||||||
<tbody>
|
<tbody>
|
||||||
<tr>
|
<tr>
|
||||||
<td><para>Cloud Controller</para></td>
|
<td><para>Cloud controller</para></td>
|
||||||
<td><para>
|
<td><para>
|
||||||
<code>nova-*</code>
|
<code>nova-*</code>
|
||||||
</para></td>
|
</para></td>
|
||||||
@@ -40,7 +41,7 @@
|
|||||||
</para></td>
|
</para></td>
|
||||||
</tr>
|
</tr>
|
||||||
<tr>
|
<tr>
|
||||||
<td><para>Cloud Controller</para></td>
|
<td><para>Cloud controller</para></td>
|
||||||
<td><para>
|
<td><para>
|
||||||
<code>glance-*</code>
|
<code>glance-*</code>
|
||||||
</para></td>
|
</para></td>
|
||||||
@@ -49,7 +50,7 @@
|
|||||||
</para></td>
|
</para></td>
|
||||||
</tr>
|
</tr>
|
||||||
<tr>
|
<tr>
|
||||||
<td><para>Cloud Controller</para></td>
|
<td><para>Cloud controller</para></td>
|
||||||
<td><para>
|
<td><para>
|
||||||
<code>cinder-*</code>
|
<code>cinder-*</code>
|
||||||
</para></td>
|
</para></td>
|
||||||
@@ -58,7 +59,7 @@
|
|||||||
</para></td>
|
</para></td>
|
||||||
</tr>
|
</tr>
|
||||||
<tr>
|
<tr>
|
||||||
<td><para>Cloud Controller</para></td>
|
<td><para>Cloud controller</para></td>
|
||||||
<td><para>
|
<td><para>
|
||||||
<code>keystone-*</code>
|
<code>keystone-*</code>
|
||||||
</para></td>
|
</para></td>
|
||||||
@@ -67,7 +68,7 @@
|
|||||||
</para></td>
|
</para></td>
|
||||||
</tr>
|
</tr>
|
||||||
<tr>
|
<tr>
|
||||||
<td><para>Cloud Controller</para></td>
|
<td><para>Cloud controller</para></td>
|
||||||
<td><para>
|
<td><para>
|
||||||
<code>neutron-*</code>
|
<code>neutron-*</code>
|
||||||
</para></td>
|
</para></td>
|
||||||
@@ -76,7 +77,7 @@
|
|||||||
</para></td>
|
</para></td>
|
||||||
</tr>
|
</tr>
|
||||||
<tr>
|
<tr>
|
||||||
<td><para>Cloud Controller</para></td>
|
<td><para>Cloud controller</para></td>
|
||||||
<td><para>horizon</para></td>
|
<td><para>horizon</para></td>
|
||||||
<td><para>
|
<td><para>
|
||||||
<code>/var/log/apache2/</code>
|
<code>/var/log/apache2/</code>
|
||||||
@@ -84,21 +85,21 @@
|
|||||||
</tr>
|
</tr>
|
||||||
<tr>
|
<tr>
|
||||||
<td><para>All nodes</para></td>
|
<td><para>All nodes</para></td>
|
||||||
<td><para>misc (Swift,
|
<td><para>misc (swift,
|
||||||
dnsmasq)</para></td>
|
dnsmasq)</para></td>
|
||||||
<td><para>
|
<td><para>
|
||||||
<code>/var/log/syslog</code>
|
<code>/var/log/syslog</code>
|
||||||
</para></td>
|
</para></td>
|
||||||
</tr>
|
</tr>
|
||||||
<tr>
|
<tr>
|
||||||
<td><para>Compute Nodes</para></td>
|
<td><para>Compute nodes</para></td>
|
||||||
<td><para>libvirt</para></td>
|
<td><para>libvirt</para></td>
|
||||||
<td><para>
|
<td><para>
|
||||||
<code>/var/log/libvirt/libvirtd.log</code>
|
<code>/var/log/libvirt/libvirtd.log</code>
|
||||||
</para></td>
|
</para></td>
|
||||||
</tr>
|
</tr>
|
||||||
<tr>
|
<tr>
|
||||||
<td><para>Compute Nodes</para></td>
|
<td><para>Compute nodes</para></td>
|
||||||
<td><para>Console (boot up messages) for VM instances:</para></td>
|
<td><para>Console (boot up messages) for VM instances:</para></td>
|
||||||
<td><para>
|
<td><para>
|
||||||
<code>/var/lib/nova/instances/instance-<instance
|
<code>/var/lib/nova/instances/instance-<instance
|
||||||
@@ -106,36 +107,36 @@
|
|||||||
</para></td>
|
</para></td>
|
||||||
</tr>
|
</tr>
|
||||||
<tr>
|
<tr>
|
||||||
<td><para>Block Storage Nodes</para></td>
|
<td><para>Block Storage nodes</para></td>
|
||||||
<td><para>cinder-volume</para></td>
|
<td><para>cinder-volume</para></td>
|
||||||
<td><para>
|
<td><para>
|
||||||
<code>/var/log/cinder/cinder-volume.log</code>
|
<code>/var/log/cinder/cinder-volume.log</code>
|
||||||
</para></td>
|
</para></td>
|
||||||
</tr>
|
</tr>
|
||||||
</tbody>
|
</tbody>
|
||||||
</informaltable>
|
</table>
|
||||||
</section>
|
</section>
|
||||||
<section xml:id="how_to_read_logs">
|
<section xml:id="how_to_read_logs">
|
||||||
<title>Reading the Logs</title>
|
<title>Reading the Logs</title>
|
||||||
<para>OpenStack services use the standard logging levels, at
|
<para>OpenStack services use the standard logging levels, at
|
||||||
increasing severity: DEBUG, INFO, AUDIT, WARNING, ERROR,
|
increasing severity: DEBUG, INFO, AUDIT, WARNING, ERROR,
|
||||||
CRITICAL, and TRACE. That is, messages only appear in the logs
|
CRITICAL, and TRACE. That is, messages only appear in the logs
|
||||||
if they are more "severe" than the particular log level
|
if they are more "severe" than the particular log level,
|
||||||
with DEBUG allowing all log statements through. For
|
with DEBUG allowing all log statements through. For
|
||||||
example, TRACE is logged only if the software has a stack
|
example, TRACE is logged only if the software has a stack
|
||||||
trace, while INFO is logged for every message including
|
trace, while INFO is logged for every message including
|
||||||
those that are only for information.</para>
|
those that are only for information.</para>
|
||||||
<para>To disable DEBUG-level logging, edit
|
<para>To disable DEBUG-level logging, edit
|
||||||
<filename>/etc/nova/nova.conf</filename>:</para>
|
<filename>/etc/nova/nova.conf</filename> as follows:</para>
|
||||||
<programlisting language="ini">debug=false</programlisting>
|
<programlisting language="ini">debug=false</programlisting>
|
||||||
<para>Keystone is handled a little differently. To modify the
|
<para>Keystone is handled a little differently. To modify the
|
||||||
logging level, edit the
|
logging level, edit the
|
||||||
<filename>/etc/keystone/logging.conf</filename> file and look
|
<filename>/etc/keystone/logging.conf</filename> file and look
|
||||||
at the <code>logger_root</code> and <code>handler_file</code>
|
at the <code>logger_root</code> and <code>handler_file</code>
|
||||||
sections.</para>
|
sections.</para>
|
||||||
<para>Logging for Horizon is configured in
|
<para>Logging for horizon is configured in
|
||||||
<filename>/etc/openstack_dashboard/local_settings.py</filename>.
|
<filename>/etc/openstack_dashboard/local_settings.py</filename>.
|
||||||
As Horizon is a Django web application, it follows the
|
Because horizon is a Django web application, it follows the
|
||||||
<link xlink:title="Django Logging"
|
<link xlink:title="Django Logging"
|
||||||
xlink:href="https://docs.djangoproject.com/en/dev/topics/logging/"
|
xlink:href="https://docs.djangoproject.com/en/dev/topics/logging/"
|
||||||
>Django Logging</link>
|
>Django Logging</link>
|
||||||
@@ -144,7 +145,7 @@
|
|||||||
<para>The first step in finding the source of an error is
|
<para>The first step in finding the source of an error is
|
||||||
typically to search for a CRITICAL, TRACE, or ERROR
|
typically to search for a CRITICAL, TRACE, or ERROR
|
||||||
message in the log starting at the bottom of the log file.</para>
|
message in the log starting at the bottom of the log file.</para>
|
||||||
<para>An example of a CRITICAL log message, with the
|
<para>Here is an example of a CRITICAL log message, with the
|
||||||
corresponding TRACE (Python traceback) immediately
|
corresponding TRACE (Python traceback) immediately
|
||||||
following:</para>
|
following:</para>
|
||||||
<screen><computeroutput>2013-02-25 21:05:51 17409 CRITICAL cinder [-] Bad or unexpected response from the storage volume backend API: volume group
|
<screen><computeroutput>2013-02-25 21:05:51 17409 CRITICAL cinder [-] Bad or unexpected response from the storage volume backend API: volume group
|
||||||
@@ -179,10 +180,10 @@
|
|||||||
2013-02-25 21:05:51 17409 TRACE cinder</computeroutput></screen>
|
2013-02-25 21:05:51 17409 TRACE cinder</computeroutput></screen>
|
||||||
<para>In this example, cinder-volumes failed to start and has
|
<para>In this example, cinder-volumes failed to start and has
|
||||||
provided a stack trace, since its volume back-end has been
|
provided a stack trace, since its volume back-end has been
|
||||||
unable to setup the storage volume - probably because the
|
unable to set up the storage volume—probably because the
|
||||||
LVM volume that is expected from the configuration does
|
LVM volume that is expected from the configuration does
|
||||||
not exist.</para>
|
not exist.</para>
|
||||||
<para>An example error log:</para>
|
<para>Here is an example error log:</para>
|
||||||
<screen><computeroutput>2013-02-25 20:26:33 6619 ERROR nova.openstack.common.rpc.common [-] AMQP server on localhost:5672 is unreachable:
|
<screen><computeroutput>2013-02-25 20:26:33 6619 ERROR nova.openstack.common.rpc.common [-] AMQP server on localhost:5672 is unreachable:
|
||||||
[Errno 111] ECONNREFUSED. Trying again in 23 seconds.</computeroutput></screen>
|
[Errno 111] ECONNREFUSED. Trying again in 23 seconds.</computeroutput></screen>
|
||||||
<para>In this error, a nova service has failed to connect to
|
<para>In this error, a nova service has failed to connect to
|
||||||
@@ -209,10 +210,10 @@
|
|||||||
<code>faf7ded8-4a46-413b-b113-f19590746ffe</code>. If
|
<code>faf7ded8-4a46-413b-b113-f19590746ffe</code>. If
|
||||||
you search for this string on the cloud controller in the
|
you search for this string on the cloud controller in the
|
||||||
<filename>/var/log/nova-*.log</filename> files, it appears in
|
<filename>/var/log/nova-*.log</filename> files, it appears in
|
||||||
<filename>nova-api.log</filename>, and
|
<filename>nova-api.log</filename> and
|
||||||
<filename>nova-scheduler.log</filename>. If you search for
|
<filename>nova-scheduler.log</filename>. If you search for
|
||||||
this on the compute nodes in
|
this on the compute nodes in
|
||||||
<filename>/var/log/nova-*.log</filename>, it appears
|
<filename>/var/log/nova-*.log</filename>, it appears in
|
||||||
<filename>nova-network.log</filename> and
|
<filename>nova-network.log</filename> and
|
||||||
<filename>nova-compute.log</filename>. If no ERROR or CRITICAL
|
<filename>nova-compute.log</filename>. If no ERROR or CRITICAL
|
||||||
messages appear, the most recent log entry that reports
|
messages appear, the most recent log entry that reports
|
||||||
@@ -233,11 +234,11 @@
|
|||||||
LOG = logging.getLogger(__name__)</programlisting>
|
LOG = logging.getLogger(__name__)</programlisting>
|
||||||
<para>To add a DEBUG logging statement, you would do:</para>
|
<para>To add a DEBUG logging statement, you would do:</para>
|
||||||
<programlisting language="python">LOG.debug("This is a custom debugging statement")</programlisting>
|
<programlisting language="python">LOG.debug("This is a custom debugging statement")</programlisting>
|
||||||
<para>You may notice that all of the existing logging messages
|
<para>You may notice that all the existing logging messages
|
||||||
are preceded by an underscore and surrounded by
|
are preceded by an underscore and surrounded by
|
||||||
parentheses, for example:</para>
|
parentheses, for example:</para>
|
||||||
<programlisting language="python">LOG.debug(_("Logging statement appears here"))</programlisting>
|
<programlisting language="python">LOG.debug(_("Logging statement appears here"))</programlisting>
|
||||||
<para>This is used to support translation of logging messages
|
<para>This formatting is used to support translation of logging messages
|
||||||
into different languages using the <link
|
into different languages using the <link
|
||||||
xlink:href="http://docs.python.org/2/library/gettext.html"
|
xlink:href="http://docs.python.org/2/library/gettext.html"
|
||||||
>gettext</link>
|
>gettext</link>
|
||||||
@@ -256,9 +257,7 @@ LOG = logging.getLogger(__name__)</programlisting>
|
|||||||
issues. Instead, we recommend you use the RabbitMQ web
|
issues. Instead, we recommend you use the RabbitMQ web
|
||||||
management interface. Enable it on your cloud
|
management interface. Enable it on your cloud
|
||||||
controller:</para>
|
controller:</para>
|
||||||
<screen><prompt>#</prompt>
|
<screen><prompt>#</prompt> <userinput>/usr/lib/rabbitmq/bin/rabbitmq-plugins enable rabbitmq_management</userinput></screen>
|
||||||
<userinput>/usr/lib/rabbitmq/bin/rabbitmq-plugins enable
|
|
||||||
rabbitmq_management</userinput></screen>
|
|
||||||
<screen><prompt>#</prompt> <userinput>service rabbitmq-server restart</userinput></screen>
|
<screen><prompt>#</prompt> <userinput>service rabbitmq-server restart</userinput></screen>
|
||||||
<para>The RabbitMQ web management interface is accessible on
|
<para>The RabbitMQ web management interface is accessible on
|
||||||
your cloud controller at http://localhost:55672.</para>
|
your cloud controller at http://localhost:55672.</para>
|
||||||
@@ -271,11 +270,11 @@ LOG = logging.getLogger(__name__)</programlisting>
|
|||||||
<screen><prompt>$</prompt> <userinput>dpkg -s rabbitmq-server | grep "Version:"
|
<screen><prompt>$</prompt> <userinput>dpkg -s rabbitmq-server | grep "Version:"
|
||||||
Version: 2.7.1-0ubuntu4</userinput></screen>
|
Version: 2.7.1-0ubuntu4</userinput></screen>
|
||||||
</note>
|
</note>
|
||||||
<para>An alternative to enabling the RabbitMQ Web Management
|
<para>An alternative to enabling the RabbitMQ web management
|
||||||
Interface is to use the <command>rabbitmqctl</command> commands. For example,
|
interface is to use the <command>rabbitmqctl</command> commands. For example,
|
||||||
<command>rabbitmqctl list_queues| grep
|
<command>rabbitmqctl list_queues| grep
|
||||||
cinder</command> displays any messages
|
cinder</command> displays any messages
|
||||||
left in the queue. If there are, it's a possible sign that
|
left in the queue. If any messages are there, it's a possible sign that
|
||||||
cinder services didn't connect properly to rabbitmq and
|
cinder services didn't connect properly to rabbitmq and
|
||||||
might have to be restarted.</para>
|
might have to be restarted.</para>
|
||||||
<para>Items to monitor for RabbitMQ include the number of
|
<para>Items to monitor for RabbitMQ include the number of
|
||||||
@@ -287,14 +286,14 @@ Version: 2.7.1-0ubuntu4</userinput></screen>
|
|||||||
<para>Because your cloud is most likely composed of many
|
<para>Because your cloud is most likely composed of many
|
||||||
servers, you must check logs on each of those servers to
|
servers, you must check logs on each of those servers to
|
||||||
properly piece an event together. A better solution is to
|
properly piece an event together. A better solution is to
|
||||||
send the logs of all servers to a central location so they
|
send the logs of all servers to a central location so that they
|
||||||
can all be accessed from the same area.</para>
|
can all be accessed from the same area.</para>
|
||||||
<para>Ubuntu uses rsyslog as the default logging service.
|
<para>Ubuntu uses rsyslog as the default logging service.
|
||||||
Since it is natively able to send logs to a remote
|
Since it is natively able to send logs to a remote
|
||||||
location, you don't have to install anything extra to
|
location, you don't have to install anything extra to
|
||||||
enable this feature, just modify the configuration file.
|
enable this feature, just modify the configuration file.
|
||||||
In doing this, consider running your logging over a
|
In doing this, consider running your logging over a
|
||||||
management network, or using an encrypted VPN to avoid
|
management network or using an encrypted VPN to avoid
|
||||||
interception.</para>
|
interception.</para>
|
||||||
<section xml:id="rsyslog_client_config">
|
<section xml:id="rsyslog_client_config">
|
||||||
<title>rsyslog Client Configuration</title>
|
<title>rsyslog Client Configuration</title>
|
||||||
@@ -327,8 +326,8 @@ syslog_log_facility=LOG_LOCAL3</programlisting>
|
|||||||
following line:</para>
|
following line:</para>
|
||||||
<programlisting language="ini">*.* @192.168.1.10</programlisting>
|
<programlisting language="ini">*.* @192.168.1.10</programlisting>
|
||||||
<para>This instructs rsyslog to send all logs to the IP
|
<para>This instructs rsyslog to send all logs to the IP
|
||||||
listed. In this example, the IP points to the Cloud
|
listed. In this example, the IP points to the cloud
|
||||||
Controller.</para>
|
controller.</para>
|
||||||
</section>
|
</section>
|
||||||
<section xml:id="rsyslog_server_config">
|
<section xml:id="rsyslog_server_config">
|
||||||
<title>rsyslog Server Configuration</title>
|
<title>rsyslog Server Configuration</title>
|
||||||
@@ -360,7 +359,7 @@ $template DynFile,"/var/log/rsyslog/%HOSTNAME%/syslog.log"
|
|||||||
local0.* ?NovaFile
|
local0.* ?NovaFile
|
||||||
local0.* ?NovaAll
|
local0.* ?NovaAll
|
||||||
& ~</programlisting>
|
& ~</programlisting>
|
||||||
<para>The above example configuration handles the nova service only.
|
<para>This example configuration handles the nova service only.
|
||||||
It first configures rsyslog to act as a server that runs on port
|
It first configures rsyslog to act as a server that runs on port
|
||||||
514. Next, it creates a series of logging templates. Logging
|
514. Next, it creates a series of logging templates. Logging
|
||||||
templates control where received logs are stored. Using
|
templates control where received logs are stored. Using
|
||||||
@@ -378,7 +377,7 @@ local0.* ?NovaAll
|
|||||||
</para>
|
</para>
|
||||||
</listitem>
|
</listitem>
|
||||||
</itemizedlist>
|
</itemizedlist>
|
||||||
<para>This is useful as logs from c02.example.com go to:</para>
|
<para>This is useful, as logs from c02.example.com go to:</para>
|
||||||
<itemizedlist>
|
<itemizedlist>
|
||||||
<listitem>
|
<listitem>
|
||||||
<para>
|
<para>
|
||||||
@@ -397,10 +396,12 @@ local0.* ?NovaAll
|
|||||||
</section>
|
</section>
|
||||||
</section>
|
</section>
|
||||||
<section xml:id="stacktach">
|
<section xml:id="stacktach">
|
||||||
|
<!-- FIXME This section needs updating, especially with the advent of
|
||||||
|
ceilometer -->
|
||||||
<title>StackTach</title>
|
<title>StackTach</title>
|
||||||
<para>StackTach is a tool created by Rackspace to collect and
|
<para>StackTach is a tool created by Rackspace to collect and
|
||||||
report the notifications sent by <code>nova</code>.
|
report the notifications sent by <code>nova</code>.
|
||||||
Notifications are essentially the same as logs, but can be
|
Notifications are essentially the same as logs but can be
|
||||||
much more detailed. A good overview of notifications can
|
much more detailed. A good overview of notifications can
|
||||||
be found at <link xlink:title="StackTach GitHub repo"
|
be found at <link xlink:title="StackTach GitHub repo"
|
||||||
xlink:href="https://wiki.openstack.org/wiki/SystemUsageData"
|
xlink:href="https://wiki.openstack.org/wiki/SystemUsageData"
|
||||||
@@ -433,7 +434,7 @@ notification_driver=nova.openstack.common.notifier.rabbit_notifier</programlisti
|
|||||||
capable of executing arbitrary commands to check the
|
capable of executing arbitrary commands to check the
|
||||||
status of server and network services, remotely
|
status of server and network services, remotely
|
||||||
executing arbitrary commands directly on servers, and
|
executing arbitrary commands directly on servers, and
|
||||||
allow servers to push notifications back in the form
|
allowing servers to push notifications back in the form
|
||||||
of passive monitoring. Nagios has been around since
|
of passive monitoring. Nagios has been around since
|
||||||
1999. Although newer monitoring services are
|
1999. Although newer monitoring services are
|
||||||
available, Nagios is a tried-and-true systems
|
available, Nagios is a tried-and-true systems
|
||||||
@@ -442,9 +443,9 @@ notification_driver=nova.openstack.common.notifier.rabbit_notifier</programlisti
|
|||||||
<section xml:id="process_monitoring">
|
<section xml:id="process_monitoring">
|
||||||
<title>Process Monitoring</title>
|
<title>Process Monitoring</title>
|
||||||
<para>A basic type of alert monitoring is to simply check
|
<para>A basic type of alert monitoring is to simply check
|
||||||
and see if a required process is running. For example,
|
and see whether a required process is running. For example,
|
||||||
ensure that the <code>nova-api</code> service is
|
ensure that the <code>nova-api</code> service is
|
||||||
running on the Cloud Controller:</para>
|
running on the cloud controller:</para>
|
||||||
<screen><prompt>#</prompt> <userinput>ps aux | grep nova-api</userinput>
|
<screen><prompt>#</prompt> <userinput>ps aux | grep nova-api</userinput>
|
||||||
<computeroutput>nova 12786 0.0 0.0 37952 1312 ? Ss Feb11 0:00 su -s /bin/sh -c exec nova-api --config-file=/etc/nova/nova.conf nova
|
<computeroutput>nova 12786 0.0 0.0 37952 1312 ? Ss Feb11 0:00 su -s /bin/sh -c exec nova-api --config-file=/etc/nova/nova.conf nova
|
||||||
nova 12787 0.0 0.1 135764 57400 ? S Feb11 0:01 /usr/bin/python /usr/bin/nova-api --config-file=/etc/nova/nova.conf
|
nova 12787 0.0 0.1 135764 57400 ? S Feb11 0:01 /usr/bin/python /usr/bin/nova-api --config-file=/etc/nova/nova.conf
|
||||||
@@ -477,22 +478,22 @@ root 24121 0.0 0.0 11688 912 pts/5 S+ 13:07 0:00 grep nova-api</computeroutput><
|
|||||||
more resources are critically low. While the
|
more resources are critically low. While the
|
||||||
monitoring thresholds should be tuned to your specific
|
monitoring thresholds should be tuned to your specific
|
||||||
OpenStack environment, monitoring resource usage is
|
OpenStack environment, monitoring resource usage is
|
||||||
not specific to OpenStack at all – any generic type of
|
not specific to OpenStack at all–any generic type of
|
||||||
alert will work fine.</para>
|
alert will work fine.</para>
|
||||||
<para>Some of the resources that you want to monitor
|
<para>Some of the resources that you want to monitor
|
||||||
include:</para>
|
include:</para>
|
||||||
<itemizedlist>
|
<itemizedlist>
|
||||||
<listitem>
|
<listitem>
|
||||||
<para>Disk Usage</para>
|
<para>Disk usage</para>
|
||||||
</listitem>
|
</listitem>
|
||||||
<listitem>
|
<listitem>
|
||||||
<para>Server Load</para>
|
<para>Server load</para>
|
||||||
</listitem>
|
</listitem>
|
||||||
<listitem>
|
<listitem>
|
||||||
<para>Memory Usage</para>
|
<para>Memory usage</para>
|
||||||
</listitem>
|
</listitem>
|
||||||
<listitem>
|
<listitem>
|
||||||
<para>Network IO</para>
|
<para>Network I/O</para>
|
||||||
</listitem>
|
</listitem>
|
||||||
<listitem>
|
<listitem>
|
||||||
<para>Available vCPUs</para>
|
<para>Available vCPUs</para>
|
||||||
@@ -512,8 +513,8 @@ root 24121 0.0 0.0 11688 912 pts/5 S+ 13:07 0:00 grep nova-api</computeroutput><
|
|||||||
configuration:</para>
|
configuration:</para>
|
||||||
<programlisting><?db-font-size 75%?>command[check_all_disks]=/usr/lib/nagios/plugins/check_disk -w $ARG1$ -c $ARG2$ -e</programlisting>
|
<programlisting><?db-font-size 75%?>command[check_all_disks]=/usr/lib/nagios/plugins/check_disk -w $ARG1$ -c $ARG2$ -e</programlisting>
|
||||||
<para>Nagios alerts you with a WARNING when any disk on
|
<para>Nagios alerts you with a WARNING when any disk on
|
||||||
the compute node is 80% full and CRITICAL when 90% is
|
the compute node is 80 percent full and CRITICAL when 90
|
||||||
full.</para>
|
percent is full.</para>
|
||||||
</section>
|
</section>
|
||||||
<section xml:id="metering_telemetry">
|
<section xml:id="metering_telemetry">
|
||||||
<title>Metering and Telemetry with Ceilometer</title>
|
<title>Metering and Telemetry with Ceilometer</title>
|
||||||
@@ -530,13 +531,13 @@ root 24121 0.0 0.0 11688 912 pts/5 S+ 13:07 0:00 grep nova-api</computeroutput><
|
|||||||
xlink:href="http://docs.openstack.org/developer/ceilometer/"
|
xlink:href="http://docs.openstack.org/developer/ceilometer/"
|
||||||
>http://docs.openstack.org/developer/ceilometer/</link>.</para></section>
|
>http://docs.openstack.org/developer/ceilometer/</link>.</para></section>
|
||||||
<section xml:id="os_resources">
|
<section xml:id="os_resources">
|
||||||
<title>OpenStack-specific Resources</title>
|
<title>OpenStack-Specific Resources</title>
|
||||||
<para>Resources such as memory, disk, and CPU are generic
|
<para>Resources such as memory, disk, and CPU are generic
|
||||||
resources that all servers (even non-OpenStack
|
resources that all servers (even non-OpenStack
|
||||||
servers) have and are important to the overall health
|
servers) have and are important to the overall health
|
||||||
of the server. When dealing with OpenStack
|
of the server. When dealing with OpenStack
|
||||||
specifically, these resources are important for a
|
specifically, these resources are important for a
|
||||||
second reason: ensuring enough are available in order
|
second reason: ensuring that enough are available
|
||||||
to launch instances. There are a few ways you can see
|
to launch instances. There are a few ways you can see
|
||||||
OpenStack resource usage.</para>
|
OpenStack resource usage.</para>
|
||||||
<para>The first is through the <code>nova</code>
|
<para>The first is through the <code>nova</code>
|
||||||
@@ -545,14 +546,14 @@ root 24121 0.0 0.0 11688 912 pts/5 S+ 13:07 0:00 grep nova-api</computeroutput><
|
|||||||
<para>This command displays a list of how many instances a
|
<para>This command displays a list of how many instances a
|
||||||
tenant has running and some light usage statistics
|
tenant has running and some light usage statistics
|
||||||
about the combined instances. This command is useful
|
about the combined instances. This command is useful
|
||||||
for a quick overview of your cloud, but doesn't really
|
for a quick overview of your cloud, but it doesn't really
|
||||||
get into a lot of details.</para>
|
get into a lot of details.</para>
|
||||||
<para>Next, the <code>nova</code> database contains three
|
<para>Next, the <code>nova</code> database contains three
|
||||||
tables that store usage information.</para>
|
tables that store usage information.</para>
|
||||||
<para>The <code>nova.quotas</code> and
|
<para>The <code>nova.quotas</code> and
|
||||||
<code>nova.quota_usages</code> tables store quota
|
<code>nova.quota_usages</code> tables store quota
|
||||||
information. If a tenant's quota is different than the
|
information. If a tenant's quota is different from the
|
||||||
default quota settings, their quota is stored in
|
default quota settings, its quota is stored in the
|
||||||
<code>nova.quotas</code> table. For
|
<code>nova.quotas</code> table. For
|
||||||
example:</para>
|
example:</para>
|
||||||
<screen><prompt>mysql></prompt> <userinput>select project_id, resource, hard_limit from quotas;</userinput>
|
<screen><prompt>mysql></prompt> <userinput>select project_id, resource, hard_limit from quotas;</userinput>
|
||||||
@@ -587,12 +588,12 @@ root 24121 0.0 0.0 11688 912 pts/5 S+ 13:07 0:00 grep nova-api</computeroutput><
|
|||||||
<para>By comparing a tenant's hard limit with their
|
<para>By comparing a tenant's hard limit with their
|
||||||
current resource usage, you can see their usage
|
current resource usage, you can see their usage
|
||||||
percentage. For example, if this tenant is using 1
|
percentage. For example, if this tenant is using 1
|
||||||
Floating IP out of 10, then they are using 10% of
|
floating IP out of 10, then they are using 10 percent of
|
||||||
their Floating IP quota. Rather than doing the
|
their floating IP quota. Rather than doing the
|
||||||
calculation manually, you can use SQL or the scripting
|
calculation manually, you can use SQL or the scripting
|
||||||
language of your choice and create a formatted
|
language of your choice and create a formatted
|
||||||
report:</para>
|
report:</para>
|
||||||
<screen><computeroutput>+----------------------------------+------------+-------------+---------------+
|
<screen><computeroutput>+----------------------------------+------------+-------------+---------------+
|
||||||
| some_tenant |
|
| some_tenant |
|
||||||
+-----------------------------------+------------+------------+---------------+
|
+-----------------------------------+------------+------------+---------------+
|
||||||
| Resource | Used | Limit | |
|
| Resource | Used | Limit | |
|
||||||
@@ -613,8 +614,8 @@ root 24121 0.0 0.0 11688 912 pts/5 S+ 13:07 0:00 grep nova-api</computeroutput><
|
|||||||
| security_groups | 0 | 10 | 0 % |
|
| security_groups | 0 | 10 | 0 % |
|
||||||
| volumes | 2 | 10 | 20 % |
|
| volumes | 2 | 10 | 20 % |
|
||||||
+-----------------------------------+------------+------------+---------------+</computeroutput></screen>
|
+-----------------------------------+------------+------------+---------------+</computeroutput></screen>
|
||||||
<para>The above was generated using a custom script which
|
<para>The above information was generated by using a custom script
|
||||||
can be found on GitHub
|
that can be found on GitHub
|
||||||
(https://github.com/cybera/novac/blob/dev/libexec/novac-quota-report).</para>
|
(https://github.com/cybera/novac/blob/dev/libexec/novac-quota-report).</para>
|
||||||
<note>
|
<note>
|
||||||
<para>This script is specific to a certain OpenStack
|
<para>This script is specific to a certain OpenStack
|
||||||
@@ -627,15 +628,15 @@ root 24121 0.0 0.0 11688 912 pts/5 S+ 13:07 0:00 grep nova-api</computeroutput><
|
|||||||
<title>Intelligent Alerting</title>
|
<title>Intelligent Alerting</title>
|
||||||
<para>Intelligent alerting can be thought of as a form of
|
<para>Intelligent alerting can be thought of as a form of
|
||||||
continuous integration for operations. For example,
|
continuous integration for operations. For example,
|
||||||
you can easily check to see if the Image Service is up and
|
you can easily check to see whether the Image Service is up and
|
||||||
running by ensuring that the <code>glance-api</code>
|
running by ensuring that the <code>glance-api</code>
|
||||||
and <code>glance-registry</code> processes are running
|
and <code>glance-registry</code> processes are running
|
||||||
or by seeing if <code>glace-api</code> is responding
|
or by seeing whether <code>glace-api</code> is responding
|
||||||
on port 9292.</para>
|
on port 9292.</para>
|
||||||
<para>But how can you tell if images are being
|
<para>But how can you tell whether images are being
|
||||||
successfully uploaded to the Image Service? Maybe the
|
successfully uploaded to the Image Service? Maybe the
|
||||||
disk that Image Service is storing the images on is
|
disk that Image Service is storing the images on is
|
||||||
full or the S3 back-end is down. You could naturally
|
full or the S3 backend is down. You could naturally
|
||||||
check this by doing a quick image upload:</para>
|
check this by doing a quick image upload:</para>
|
||||||
<programlisting language="bash">#!/bin/bash
|
<programlisting language="bash">#!/bin/bash
|
||||||
#
|
#
|
||||||
@@ -649,35 +650,35 @@ glance image-create --name='cirros image' --is-public=true --container-format=ba
|
|||||||
6_64-disk.img</programlisting>
|
6_64-disk.img</programlisting>
|
||||||
<para>By taking this script and rolling it into an alert
|
<para>By taking this script and rolling it into an alert
|
||||||
for your monitoring system (such as Nagios), you now
|
for your monitoring system (such as Nagios), you now
|
||||||
have an automated way of ensuring image uploads to the
|
have an automated way of ensuring that image uploads to the
|
||||||
Image Catalog are working.</para>
|
Image Catalog are working.</para>
|
||||||
<note>
|
<note>
|
||||||
<para>You must remove the image after each test. Even
|
<para>You must remove the image after each test. Even
|
||||||
better, test whether you can successfully delete
|
better, test whether you can successfully delete
|
||||||
an image from the Image Service.</para>
|
an image from the Image Service.</para>
|
||||||
</note>
|
</note>
|
||||||
<para>Intelligent alerting takes a considerable more
|
<para>Intelligent alerting takes considerably more
|
||||||
amount of time to plan and implement than the other
|
time to plan and implement than the other
|
||||||
alerts described in this chapter. A good outline to
|
alerts described in this chapter. A good outline to
|
||||||
implement intelligent alerting is:</para>
|
implement intelligent alerting is:</para>
|
||||||
<itemizedlist>
|
<itemizedlist>
|
||||||
<listitem>
|
<listitem>
|
||||||
<para>Review common actions in your cloud</para>
|
<para>Review common actions in your cloud.</para>
|
||||||
</listitem>
|
</listitem>
|
||||||
<listitem>
|
<listitem>
|
||||||
<para>Create ways to automatically test these
|
<para>Create ways to automatically test these
|
||||||
actions</para>
|
actions.</para>
|
||||||
</listitem>
|
</listitem>
|
||||||
<listitem>
|
<listitem>
|
||||||
<para>Roll these tests into an alerting
|
<para>Roll these tests into an alerting
|
||||||
system</para>
|
system.</para>
|
||||||
</listitem>
|
</listitem>
|
||||||
</itemizedlist>
|
</itemizedlist>
|
||||||
<para>Some other examples for Intelligent Alerting
|
<para>Some other examples for Intelligent Alerting
|
||||||
include:</para>
|
include:</para>
|
||||||
<itemizedlist>
|
<itemizedlist>
|
||||||
<listitem>
|
<listitem>
|
||||||
<para>Can instances launch and destroyed?</para>
|
<para>Can instances launch and be destroyed?</para>
|
||||||
</listitem>
|
</listitem>
|
||||||
<listitem>
|
<listitem>
|
||||||
<para>Can users be created?</para>
|
<para>Can users be created?</para>
|
||||||
@@ -693,7 +694,7 @@ glance image-create --name='cirros image' --is-public=true --container-format=ba
|
|||||||
<section xml:id="trending">
|
<section xml:id="trending">
|
||||||
<title>Trending</title>
|
<title>Trending</title>
|
||||||
<para>Trending can give you great insight into how your
|
<para>Trending can give you great insight into how your
|
||||||
cloud is performing day to day. For example, if a busy
|
cloud is performing day to day. You can learn, for example, if a busy
|
||||||
day was simply a rare occurrence or if you should
|
day was simply a rare occurrence or if you should
|
||||||
start adding new compute nodes.</para>
|
start adding new compute nodes.</para>
|
||||||
<para>Trending takes a slightly different approach than
|
<para>Trending takes a slightly different approach than
|
||||||
@@ -733,7 +734,7 @@ glance image-create --name='cirros image' --is-public=true --container-format=ba
|
|||||||
<para>As an example, recording <code>nova-api</code> usage
|
<para>As an example, recording <code>nova-api</code> usage
|
||||||
can allow you to track the need to scale your cloud
|
can allow you to track the need to scale your cloud
|
||||||
controller. By keeping an eye on <code>nova-api</code>
|
controller. By keeping an eye on <code>nova-api</code>
|
||||||
requests, you can determine if you need to spawn more
|
requests, you can determine whether you need to spawn more
|
||||||
nova-api processes or go as far as introducing an
|
nova-api processes or go as far as introducing an
|
||||||
entirely new server to run <code>nova-api</code>. To
|
entirely new server to run <code>nova-api</code>. To
|
||||||
get an approximate count of the requests, look for
|
get an approximate count of the requests, look for
|
||||||
@@ -762,10 +763,10 @@ glance image-create --name='cirros image' --is-public=true --container-format=ba
|
|||||||
<title>Summary</title>
|
<title>Summary</title>
|
||||||
<para>For stable operations, you want to detect failure promptly and
|
<para>For stable operations, you want to detect failure promptly and
|
||||||
determine causes efficiently. With a distributed system, it's even
|
determine causes efficiently. With a distributed system, it's even
|
||||||
more important to track the right items to meet a service level target.
|
more important to track the right items to meet a service-level target.
|
||||||
Learning where these logs are located in the file system or API gives
|
Learning where these logs are located in the file system or API gives
|
||||||
you an advantage. Plus, we have discussed how to read, interpret, and
|
you an advantage. This chapter also showed how to read, interpret, and
|
||||||
manipulate information from OpenStack services so you can monitor
|
manipulate information from OpenStack services so that you can monitor
|
||||||
effectively.</para>
|
effectively.</para>
|
||||||
</section>
|
</section>
|
||||||
</chapter>
|
</chapter>
|
||||||
|
Reference in New Issue
Block a user