Replace host_alive test with hypervisor VM states

Currently, the host_alive_status metric is a combination of ping tests
and general VM up/down state from the hypervisor isActive().  If ping
tests are not supported in the environment or by the VM's security rules,
host_alive_status becomes practically useless.

This change proposes to make host_alive_status a reliable, stable metric
consisting of one of eight possible status codes, as reported by the
hypervisor, including shut off/paused/suspended, etc.  A value_meta
'detail' tag includes a human-readable translation of the status code,
and the codes are also added to the plugin documentation.

In addition, this change renames the 'ping_only' configuration parameter
to 'alive_only' as a means to gauge the general status of all VMs without
the additional load of performance metrics, for situations where the latter
is overwhelming.  If alive_only is True, only the VM status metrics and
aggregate metrics are published.

The 'ping_status' metric serves as the new name for ping-based testing of
VMs.

Change-Id: I77c87b222de0aa9882d1761e8cade87ef1719d7d
This commit is contained in:
David Schroeder 2015-10-30 14:13:22 -06:00
parent 10ef624576
commit 3520fba6e0
3 changed files with 93 additions and 51 deletions

View File

@ -12,6 +12,7 @@ init_config:
# How long before gathering data on newly-provisioned instances? (seconds)
vm_probation: 300
# Command line to ping VMs, set to False (or simply remove) to disable
# Note that this is currently only supported by Nova networking
ping_check: /bin/ping -n -c1 -w1 -q
# List of instance metadata keys to be sent as dimensions
# By default 'scale_group' metadata is used here for supporting auto

View File

@ -54,6 +54,7 @@
- [Instance Cache](#instance-cache)
- [Metrics Cache](#metrics-cache)
- [Per-Instance Metrics](#per-instance-metrics)
- [host_alive_status Codes](#host_alive_status-codes)
- [VM Dimensions](#vm-dimensions)
- [Aggregate Metrics](#aggregate-metrics)
- [Crash Dump Monitoring](#crash-dump-monitoring)
@ -1011,9 +1012,9 @@ If the owner of the VM is in a different tenant the Agent Cross-Tenant Metric Su
`vm_probation` specifies a period of time (in seconds) in which to suspend metrics from a newly-created VM. This is to prevent quickly-obsolete metrics in an environment with a high amount of instance churn (VMs created and destroyed in rapid succession). The default probation length is 300 seconds (five minutes). Setting to 0 disables VM probation, and metrics will be recorded as soon as possible after a VM is created.
`ping_check` includes the command line (sans the IP address) used to perform a ping check against instances. Set to False (or omit altogether) to disable ping checks. This is automatically populated during `monasca-setup` from a list of possible `ping` command lines. Generally, `fping` is preferred over `ping` because it can return a failure with sub-second resolution, but if `fping` does not exist on the system, `ping` will be used instead. If ping_check is disabled, the `host_alive_status` metric will not be published unless that VM is inactive. This is because the host status is inconclusive without a ping check.
`ping_check` includes the command line (sans the IP address) used to perform a ping check against instances. Set to False (or omit altogether) to disable ping checks. This is automatically populated during `monasca-setup` from a list of possible `ping` command lines. Generally, `fping` is preferred over `ping` because it can return a failure with sub-second resolution, but if `fping` does not exist on the system, `ping` will be used instead.
`ping_only` will suppress all per-VM metrics aside from `host_alive_status` and `vm.host_alive_status`, including all I/O, network, memory, and CPU metrics. [Aggregate Metrics](#aggregate-metrics), however, would still be enabled if `ping_only` is true. By default, `ping_only` is false. If both `ping_only` and `ping_check` are set to false, the only metrics published by the Libvirt plugin would be the Aggregate Metrics.
`alive_only` will suppress all per-VM metrics aside from `host_alive_status` and `vm.host_alive_status`, including all I/O, network, memory, ping, and CPU metrics. [Aggregate Metrics](#aggregate-metrics), however, would still be enabled if `alive_only` is true. By default, `alive_only` is false.
**Note:** Ping checks are not currently supported in compute environments that utilize network namespaces. Neutron, by default, enables namespaces, and is therefore not supported at this time. Ping checks are known to be functional with Nova networking when a guest network named 'private' is used. In any other environment, ping checks are automatically disabled, and there will be no `host_alive_status` metric, except when the hypervisor sees that the VM has shut down (in which case the value of 2 is returned, as shown in [Per-Instance Metrics](#per-instance-metrics)). Proper Neutron namespace support is planned for a future release.
@ -1029,7 +1030,7 @@ init_config:
nova_refresh: 14400
vm_probation: 300
ping_check: /usr/bin/fping -n -c1 -t250 -q
ping_only: false
alive_only: false
instances:
- {}
```
@ -1039,7 +1040,7 @@ Note: If the Nova service login credentials are changed, `monasca-setup` would n
Example `monasca-setup` usage:
```
monasca-setup -d libvirt -a 'ping_check=false ping_only=false'
monasca-setup -d libvirt -a 'ping_check=false alive_only=false' --overwrite
```
### Instance Cache
@ -1083,7 +1084,7 @@ instance-00000004:
| Name | Description | Associated Dimensions |
| -------------------- | -------------------------------------- | ---------------------- |
| cpu.utilization_perc | Overall CPU utilization (percentage) | |
| host_alive_status | Returns status: 0=passes ping check, 1=fails ping check, 2=inactive | |
| host_alive_status | See [host_alive_status Codes](#host_alive_status-codes) below | |
| io.read_ops_sec | Disk I/O read operations per second | 'device' (ie, 'hdd') |
| io.write_ops_sec | Disk I/O write operations per second | 'device' (ie, 'hdd') |
| io.read_bytes_sec | Disk I/O read bytes per second | 'device' (ie, 'hdd') |
@ -1093,11 +1094,26 @@ instance-00000004:
| net.out_packets_sec | Network transmitted packets per second | 'device' (ie, 'vnet0') |
| net.in_bytes_sec | Network received bytes per second | 'device' (ie, 'vnet0') |
| net.out_bytes_sec | Network transmitted bytes per second | 'device' (ie, 'vnet0') |
| mem.free_mb | Free memory in Mbytes | |
| mem.total_mb | Total memory in Mbytes | |
| mem.used_mb | Used memory in Mbytes | |
| mem.free_mb | Free memory in Mbytes | |
| mem.total_mb | Total memory in Mbytes | |
| mem.used_mb | Used memory in Mbytes | |
| mem.free_perc | Percent of memory free | |
| mem.swap_used_mb | Used swap space in Mbytes | |
| mem.swap_used_mb | Used swap space in Mbytes | |
| ping_status | 0 for ping success, 1 for ping failure | |
#### host_alive_status Codes
| Code | Description | value_meta 'detail' |
| ---- | -------------------------------------|--------------------------------------- |
| -1 | No state | VM has no state |
| 0 | Running / OK | None |
| 1 | Idle / blocked | VM is blocked |
| 2 | Paused | VM is paused |
| 3 | Shutting down | VM is shutting down |
| 4 | Shut off | VM has been shut off |
| 4 | Nova suspend | VM has been suspended |
| 5 | Crashed | VM has crashed |
| 6 | Power management suspend (S3 state) | VM is in power management (s3) suspend |
Memory statistics require a balloon driver on the VM. For the Linux kernel, this is the `CONFIG_VIRTIO_BALLOON` configuration parameter, active by default in Ubuntu, and enabled by default as a kernel module in Debian, CentOS, and SUSE.

View File

@ -16,6 +16,7 @@
"""Monasca Agent interface for libvirt metrics"""
import json
import libvirt
import os
import stat
import subprocess
@ -27,6 +28,14 @@ from distutils.version import LooseVersion
from monasca_agent.collector.checks import AgentCheck
from monasca_agent.collector.virt import inspector
DOM_STATES = {libvirt.VIR_DOMAIN_BLOCKED: 'VM is blocked',
libvirt.VIR_DOMAIN_CRASHED: 'VM has crashed',
libvirt.VIR_DOMAIN_NONE: 'VM has no state',
libvirt.VIR_DOMAIN_PAUSED: 'VM is paused',
libvirt.VIR_DOMAIN_PMSUSPENDED: 'VM is in power management (s3) suspend',
libvirt.VIR_DOMAIN_SHUTDOWN: 'VM is shutting down',
libvirt.VIR_DOMAIN_SHUTOFF: 'VM has been shut off'}
class LibvirtCheck(AgentCheck):
@ -281,6 +290,32 @@ class LibvirtCheck(AgentCheck):
'timestamp': sample_time,
'value': value}
def _inspect_state(self, insp, inst, instance_cache, dims_customer, dims_operations):
"""Look at the state of the instance, publish a metric using a
user-friendly description in the 'detail' metadata, and return
a status code (calibrated to UNIX status codes where 0 is OK)
so that remaining metrics can be skipped if the VM is not OK
"""
inst_name = inst.name()
dom_status = inst.state()[0] - 1
metatag = None
if inst.state()[0] in DOM_STATES:
metatag = {'detail': DOM_STATES[inst.state()[0]]}
# A nova-suspended VM has a SHUTOFF Power State, but alternate Status
if inst.state() == [libvirt.VIR_DOMAIN_SHUTOFF, 5]:
metatag = {'detail': 'VM has been suspended'}
self.gauge('host_alive_status', dom_status, dimensions=dims_customer,
delegated_tenant=instance_cache.get(inst_name)['tenant_id'],
hostname=instance_cache.get(inst_name)['hostname'],
value_meta=metatag)
self.gauge('vm.host_alive_status', dom_status,
dimensions=dims_operations,
value_meta=metatag)
return dom_status
def check(self, instance):
"""Gather VM metrics for each instance"""
@ -331,16 +366,20 @@ class LibvirtCheck(AgentCheck):
self.log.error("{0} is not known to nova after instance cache update -- skipping this ghost VM.".format(inst_name))
continue
# Skip instances that are inactive
if inst.isActive() == 0:
detail = 'Instance is not active'
self.gauge('host_alive_status', 2, dimensions=dims_customer,
delegated_tenant=instance_cache.get(inst_name)['tenant_id'],
hostname=instance_cache.get(inst_name)['hostname'],
value_meta={'detail': detail})
self.gauge('vm.host_alive_status', 2, dimensions=dims_operations,
value_meta={'detail': detail})
# Accumulate aggregate data
for gauge in agg_gauges:
if gauge in instance_cache.get(inst_name):
agg_values[gauge] += instance_cache.get(inst_name)[gauge]
# Skip further processing on VMs that are not in an active state
if self._inspect_state(insp, inst, instance_cache,
dims_customer, dims_operations) != 0:
continue
# Skip the remainder of the checks if alive_only is True in the config
if self.init_config.get('alive_only'):
continue
if inst_name not in metric_cache:
metric_cache[inst_name] = {}
@ -351,39 +390,6 @@ class LibvirtCheck(AgentCheck):
vm_probation_remaining))
continue
# Test instance's general responsiveness (ping check) if so configured
if self.init_config.get('ping_check') and 'private_ip' in instance_cache.get(inst_name):
detail = 'Ping check OK'
ping_cmd = self.init_config.get('ping_check').split()
ping_cmd.append(instance_cache.get(inst_name)['private_ip'])
with open(os.devnull, "w") as fnull:
try:
res = subprocess.call(ping_cmd,
stdout=fnull,
stderr=fnull)
if res > 0:
detail = 'Host failed ping check'
self.gauge('host_alive_status', res, dimensions=dims_customer,
delegated_tenant=instance_cache.get(inst_name)['tenant_id'],
hostname=instance_cache.get(inst_name)['hostname'],
value_meta={'detail': detail})
self.gauge('vm.host_alive_status', res, dimensions=dims_operations,
value_meta={'detail': detail})
# Do not attempt to process any more metrics for offline hosts
if res > 0:
continue
except OSError as e:
self.log.warn("OS error running '{0}' returned {1}".format(ping_cmd, e))
# Skip the remainder of the checks if ping_only is True in the config
if self.init_config.get('ping_only'):
continue
# Accumulate aggregate data
for gauge in agg_gauges:
if gauge in instance_cache.get(inst_name):
agg_values[gauge] += instance_cache.get(inst_name)[gauge]
self._inspect_cpu(insp, inst, instance_cache, metric_cache, dims_customer, dims_operations)
self._inspect_disks(insp, inst, instance_cache, metric_cache, dims_customer, dims_operations)
self._inspect_network(insp, inst, instance_cache, metric_cache, dims_customer, dims_operations)
@ -405,6 +411,25 @@ class LibvirtCheck(AgentCheck):
except KeyError:
self.log.debug("Balloon driver not active/available on guest {0} ({1})".format(inst_name,
instance_cache.get(inst_name)['hostname']))
# Test instance's remote responsiveness (ping check) if so configured
# NOTE: This is only supported for Nova networking at this time.
if self.init_config.get('ping_check') and 'private_ip' in instance_cache.get(inst_name):
ping_cmd = self.init_config.get('ping_check').split()
ping_cmd.append(instance_cache.get(inst_name)['private_ip'])
with open(os.devnull, "w") as fnull:
try:
res = subprocess.call(ping_cmd,
stdout=fnull,
stderr=fnull)
self.gauge('ping_status', res, dimensions=dims_customer,
delegated_tenant=instance_cache.get(inst_name)['tenant_id'],
hostname=instance_cache.get(inst_name)['hostname'])
self.gauge('vm.ping_status', res, dimensions=dims_operations)
# Do not attempt to process any more metrics for offline hosts
if res > 0:
continue
except OSError as e:
self.log.warn("OS error running '{0}' returned {1}".format(ping_cmd, e))
# Save these metrics for the next collector invocation
self._update_metric_cache(metric_cache)