From 3864dadb87427742bd0aa056a155e8daebcebccf Mon Sep 17 00:00:00 2001 From: Tom Fifield Date: Sun, 16 Mar 2014 21:01:41 +1100 Subject: [PATCH] Copyedit of logging and monitoring chapter Using the latest OReilly PDF, update spelling, grammar and markup for te ops guide logging and monitoring chapter Change-Id: Ib2452a47c4e9dc1201256113fa8a9dd1ba350f04 --- doc/openstack-ops/ch_ops_log_monitor.xml | 153 ++++++++++++----------- 1 file changed, 77 insertions(+), 76 deletions(-) diff --git a/doc/openstack-ops/ch_ops_log_monitor.xml b/doc/openstack-ops/ch_ops_log_monitor.xml index 976f8230..ea310366 100644 --- a/doc/openstack-ops/ch_ops_log_monitor.xml +++ b/doc/openstack-ops/ch_ops_log_monitor.xml @@ -13,15 +13,16 @@ Logging and Monitoring As an OpenStack cloud is composed of so many different - services, there are a large number of log files. This section - aims to assist you in locating and working with them, and - other ways to track the status of your deployment. + services, there are a large number of log files. This chapter + aims to assist you in locating and working with them and + describes other ways to track the status of your deployment.
Where Are the Logs? Most services use the convention of writing their log files to subdirectories of the /var/log - directory. - + directory, as listed in OpenStack Log Locations. + + @@ -31,7 +32,7 @@ - + @@ -40,7 +41,7 @@ - + @@ -49,7 +50,7 @@ - + @@ -58,7 +59,7 @@ - + @@ -67,7 +68,7 @@ - + @@ -76,7 +77,7 @@ - + - - + - + - + - +
OpenStack Log Locations
Node Type
Cloud ControllerCloud controller nova-*
Cloud ControllerCloud controller glance-*
Cloud ControllerCloud controller cinder-*
Cloud ControllerCloud controller keystone-*
Cloud ControllerCloud controller neutron-*
Cloud ControllerCloud controller horizon /var/log/apache2/ @@ -84,21 +85,21 @@
All nodesmisc (Swift, + misc (swift, dnsmasq) /var/log/syslog
Compute NodesCompute nodes libvirt /var/log/libvirt/libvirtd.log
Compute NodesCompute nodes Console (boot up messages) for VM instances: /var/lib/nova/instances/instance-<instance @@ -106,36 +107,36 @@
Block Storage NodesBlock Storage nodes cinder-volume /var/log/cinder/cinder-volume.log
Reading the Logs OpenStack services use the standard logging levels, at increasing severity: DEBUG, INFO, AUDIT, WARNING, ERROR, CRITICAL, and TRACE. That is, messages only appear in the logs - if they are more "severe" than the particular log level + if they are more "severe" than the particular log level, with DEBUG allowing all log statements through. For example, TRACE is logged only if the software has a stack trace, while INFO is logged for every message including those that are only for information. To disable DEBUG-level logging, edit - /etc/nova/nova.conf: + /etc/nova/nova.conf as follows: debug=false Keystone is handled a little differently. To modify the logging level, edit the /etc/keystone/logging.conf file and look at the logger_root and handler_file sections. - Logging for Horizon is configured in + Logging for horizon is configured in /etc/openstack_dashboard/local_settings.py. - As Horizon is a Django web application, it follows the + Because horizon is a Django web application, it follows the Django Logging @@ -144,7 +145,7 @@ The first step in finding the source of an error is typically to search for a CRITICAL, TRACE, or ERROR message in the log starting at the bottom of the log file. - An example of a CRITICAL log message, with the + Here is an example of a CRITICAL log message, with the corresponding TRACE (Python traceback) immediately following: 2013-02-25 21:05:51 17409 CRITICAL cinder [-] Bad or unexpected response from the storage volume backend API: volume group @@ -179,10 +180,10 @@ 2013-02-25 21:05:51 17409 TRACE cinder In this example, cinder-volumes failed to start and has provided a stack trace, since its volume back-end has been - unable to setup the storage volume - probably because the + unable to set up the storage volume—probably because the LVM volume that is expected from the configuration does not exist. - An example error log: + Here is an example error log: 2013-02-25 20:26:33 6619 ERROR nova.openstack.common.rpc.common [-] AMQP server on localhost:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 23 seconds. In this error, a nova service has failed to connect to @@ -209,10 +210,10 @@ faf7ded8-4a46-413b-b113-f19590746ffe. If you search for this string on the cloud controller in the /var/log/nova-*.log files, it appears in - nova-api.log, and + nova-api.log and nova-scheduler.log. If you search for this on the compute nodes in - /var/log/nova-*.log, it appears + /var/log/nova-*.log, it appears in nova-network.log and nova-compute.log. If no ERROR or CRITICAL messages appear, the most recent log entry that reports @@ -233,11 +234,11 @@ LOG = logging.getLogger(__name__) To add a DEBUG logging statement, you would do: LOG.debug("This is a custom debugging statement") - You may notice that all of the existing logging messages + You may notice that all the existing logging messages are preceded by an underscore and surrounded by parentheses, for example: LOG.debug(_("Logging statement appears here")) - This is used to support translation of logging messages + This formatting is used to support translation of logging messages into different languages using the gettext @@ -256,9 +257,7 @@ LOG = logging.getLogger(__name__) issues. Instead, we recommend you use the RabbitMQ web management interface. Enable it on your cloud controller: - # - /usr/lib/rabbitmq/bin/rabbitmq-plugins enable - rabbitmq_management + # /usr/lib/rabbitmq/bin/rabbitmq-plugins enable rabbitmq_management # service rabbitmq-server restart The RabbitMQ web management interface is accessible on your cloud controller at http://localhost:55672. @@ -271,11 +270,11 @@ LOG = logging.getLogger(__name__) $ dpkg -s rabbitmq-server | grep "Version:" Version: 2.7.1-0ubuntu4 - An alternative to enabling the RabbitMQ Web Management - Interface is to use the rabbitmqctl commands. For example, + An alternative to enabling the RabbitMQ web management + interface is to use the rabbitmqctl commands. For example, rabbitmqctl list_queues| grep cinder displays any messages - left in the queue. If there are, it's a possible sign that + left in the queue. If any messages are there, it's a possible sign that cinder services didn't connect properly to rabbitmq and might have to be restarted. Items to monitor for RabbitMQ include the number of @@ -287,14 +286,14 @@ Version: 2.7.1-0ubuntu4 Because your cloud is most likely composed of many servers, you must check logs on each of those servers to properly piece an event together. A better solution is to - send the logs of all servers to a central location so they + send the logs of all servers to a central location so that they can all be accessed from the same area. Ubuntu uses rsyslog as the default logging service. Since it is natively able to send logs to a remote location, you don't have to install anything extra to enable this feature, just modify the configuration file. In doing this, consider running your logging over a - management network, or using an encrypted VPN to avoid + management network or using an encrypted VPN to avoid interception.
rsyslog Client Configuration @@ -327,8 +326,8 @@ syslog_log_facility=LOG_LOCAL3 following line: *.* @192.168.1.10 This instructs rsyslog to send all logs to the IP - listed. In this example, the IP points to the Cloud - Controller. + listed. In this example, the IP points to the cloud + controller.
rsyslog Server Configuration @@ -360,7 +359,7 @@ $template DynFile,"/var/log/rsyslog/%HOSTNAME%/syslog.log" local0.* ?NovaFile local0.* ?NovaAll & ~ - The above example configuration handles the nova service only. + This example configuration handles the nova service only. It first configures rsyslog to act as a server that runs on port 514. Next, it creates a series of logging templates. Logging templates control where received logs are stored. Using @@ -378,7 +377,7 @@ local0.* ?NovaAll - This is useful as logs from c02.example.com go to: + This is useful, as logs from c02.example.com go to: @@ -397,10 +396,12 @@ local0.* ?NovaAll
+ StackTach StackTach is a tool created by Rackspace to collect and report the notifications sent by nova. - Notifications are essentially the same as logs, but can be + Notifications are essentially the same as logs but can be much more detailed. A good overview of notifications can be found at Process Monitoring A basic type of alert monitoring is to simply check - and see if a required process is running. For example, + and see whether a required process is running. For example, ensure that the nova-api service is - running on the Cloud Controller: + running on the cloud controller: # ps aux | grep nova-api nova 12786 0.0 0.0 37952 1312 ? Ss Feb11 0:00 su -s /bin/sh -c exec nova-api --config-file=/etc/nova/nova.conf nova nova 12787 0.0 0.1 135764 57400 ? S Feb11 0:01 /usr/bin/python /usr/bin/nova-api --config-file=/etc/nova/nova.conf @@ -477,22 +478,22 @@ root 24121 0.0 0.0 11688 912 pts/5 S+ 13:07 0:00 grep nova-api< more resources are critically low. While the monitoring thresholds should be tuned to your specific OpenStack environment, monitoring resource usage is - not specific to OpenStack at all – any generic type of + not specific to OpenStack at all–any generic type of alert will work fine. Some of the resources that you want to monitor include: - Disk Usage + Disk usage - Server Load + Server load - Memory Usage + Memory usage - Network IO + Network I/O Available vCPUs @@ -512,8 +513,8 @@ root 24121 0.0 0.0 11688 912 pts/5 S+ 13:07 0:00 grep nova-api< configuration: command[check_all_disks]=/usr/lib/nagios/plugins/check_disk -w $ARG1$ -c $ARG2$ -e Nagios alerts you with a WARNING when any disk on - the compute node is 80% full and CRITICAL when 90% is - full. + the compute node is 80 percent full and CRITICAL when 90 + percent is full.
Metering and Telemetry with Ceilometer @@ -530,13 +531,13 @@ root 24121 0.0 0.0 11688 912 pts/5 S+ 13:07 0:00 grep nova-api< xlink:href="http://docs.openstack.org/developer/ceilometer/" >http://docs.openstack.org/developer/ceilometer/.
- OpenStack-specific Resources + OpenStack-Specific Resources Resources such as memory, disk, and CPU are generic resources that all servers (even non-OpenStack servers) have and are important to the overall health of the server. When dealing with OpenStack specifically, these resources are important for a - second reason: ensuring enough are available in order + second reason: ensuring that enough are available to launch instances. There are a few ways you can see OpenStack resource usage. The first is through the nova @@ -545,14 +546,14 @@ root 24121 0.0 0.0 11688 912 pts/5 S+ 13:07 0:00 grep nova-api< This command displays a list of how many instances a tenant has running and some light usage statistics about the combined instances. This command is useful - for a quick overview of your cloud, but doesn't really + for a quick overview of your cloud, but it doesn't really get into a lot of details. Next, the nova database contains three tables that store usage information. The nova.quotas and nova.quota_usages tables store quota - information. If a tenant's quota is different than the - default quota settings, their quota is stored in + information. If a tenant's quota is different from the + default quota settings, its quota is stored in the nova.quotas table. For example: mysql> select project_id, resource, hard_limit from quotas; @@ -587,12 +588,12 @@ root 24121 0.0 0.0 11688 912 pts/5 S+ 13:07 0:00 grep nova-api< By comparing a tenant's hard limit with their current resource usage, you can see their usage percentage. For example, if this tenant is using 1 - Floating IP out of 10, then they are using 10% of - their Floating IP quota. Rather than doing the + floating IP out of 10, then they are using 10 percent of + their floating IP quota. Rather than doing the calculation manually, you can use SQL or the scripting language of your choice and create a formatted report: - +----------------------------------+------------+-------------+---------------+ ++----------------------------------+------------+-------------+---------------+ | some_tenant | +-----------------------------------+------------+------------+---------------+ | Resource | Used | Limit | | @@ -613,8 +614,8 @@ root 24121 0.0 0.0 11688 912 pts/5 S+ 13:07 0:00 grep nova-api< | security_groups | 0 | 10 | 0 % | | volumes | 2 | 10 | 20 % | +-----------------------------------+------------+------------+---------------+ - The above was generated using a custom script which - can be found on GitHub + The above information was generated by using a custom script + that can be found on GitHub (https://github.com/cybera/novac/blob/dev/libexec/novac-quota-report). This script is specific to a certain OpenStack @@ -627,15 +628,15 @@ root 24121 0.0 0.0 11688 912 pts/5 S+ 13:07 0:00 grep nova-api< Intelligent Alerting Intelligent alerting can be thought of as a form of continuous integration for operations. For example, - you can easily check to see if the Image Service is up and + you can easily check to see whether the Image Service is up and running by ensuring that the glance-api and glance-registry processes are running - or by seeing if glace-api is responding + or by seeing whether glace-api is responding on port 9292. - But how can you tell if images are being + But how can you tell whether images are being successfully uploaded to the Image Service? Maybe the disk that Image Service is storing the images on is - full or the S3 back-end is down. You could naturally + full or the S3 backend is down. You could naturally check this by doing a quick image upload: #!/bin/bash # @@ -649,35 +650,35 @@ glance image-create --name='cirros image' --is-public=true --container-format=ba 6_64-disk.img By taking this script and rolling it into an alert for your monitoring system (such as Nagios), you now - have an automated way of ensuring image uploads to the + have an automated way of ensuring that image uploads to the Image Catalog are working. You must remove the image after each test. Even better, test whether you can successfully delete an image from the Image Service. - Intelligent alerting takes a considerable more - amount of time to plan and implement than the other + Intelligent alerting takes considerably more + time to plan and implement than the other alerts described in this chapter. A good outline to implement intelligent alerting is: - Review common actions in your cloud + Review common actions in your cloud. Create ways to automatically test these - actions + actions. Roll these tests into an alerting - system + system. Some other examples for Intelligent Alerting include: - Can instances launch and destroyed? + Can instances launch and be destroyed? Can users be created? @@ -693,7 +694,7 @@ glance image-create --name='cirros image' --is-public=true --container-format=ba
Trending Trending can give you great insight into how your - cloud is performing day to day. For example, if a busy + cloud is performing day to day. You can learn, for example, if a busy day was simply a rare occurrence or if you should start adding new compute nodes. Trending takes a slightly different approach than @@ -733,7 +734,7 @@ glance image-create --name='cirros image' --is-public=true --container-format=ba As an example, recording nova-api usage can allow you to track the need to scale your cloud controller. By keeping an eye on nova-api - requests, you can determine if you need to spawn more + requests, you can determine whether you need to spawn more nova-api processes or go as far as introducing an entirely new server to run nova-api. To get an approximate count of the requests, look for @@ -762,10 +763,10 @@ glance image-create --name='cirros image' --is-public=true --container-format=ba Summary For stable operations, you want to detect failure promptly and determine causes efficiently. With a distributed system, it's even - more important to track the right items to meet a service level target. + more important to track the right items to meet a service-level target. Learning where these logs are located in the file system or API gives - you an advantage. Plus, we have discussed how to read, interpret, and - manipulate information from OpenStack services so you can monitor + you an advantage. This chapter also showed how to read, interpret, and + manipulate information from OpenStack services so that you can monitor effectively.