Replace host_alive test with hypervisor VM states

Currently, the host_alive_status metric is a combination of ping tests and general VM up/down state from the hypervisor isActive(). If ping tests are not supported in the environment or by the VM's security rules, host_alive_status becomes practically useless. This change proposes to make host_alive_status a reliable, stable metric consisting of one of eight possible status codes, as reported by the hypervisor, including shut off/paused/suspended, etc. A value_meta 'detail' tag includes a human-readable translation of the status code, and the codes are also added to the plugin documentation. In addition, this change renames the 'ping_only' configuration parameter to 'alive_only' as a means to gauge the general status of all VMs without the additional load of performance metrics, for situations where the latter is overwhelming. If alive_only is True, only the VM status metrics and aggregate metrics are published. The 'ping_status' metric serves as the new name for ping-based testing of VMs. Change-Id: I77c87b222de0aa9882d1761e8cade87ef1719d7d
2015-10-30 14:13:22 -06:00 · 2015-10-30 14:13:22 -06:00 · 3520fba6e0
parent 10ef624576
commit 3520fba6e0
3 changed files with 93 additions and 51 deletions
--- a/conf.d/libvirt.yaml.example
+++ b/conf.d/libvirt.yaml.example
@ -12,6 +12,7 @@ init_config:
    # How long before gathering data on newly-provisioned instances? (seconds)
    vm_probation: 300
    # Command line to ping VMs, set to False (or simply remove) to disable
+    # Note that this is currently only supported by Nova networking
    ping_check: /bin/ping -n -c1 -w1 -q
    # List of instance metadata keys to be sent as dimensions
    # By default 'scale_group' metadata is used here for supporting auto
--- a/docs/Plugins.md
+++ b/docs/Plugins.md
@ -54,6 +54,7 @@
    - [Instance Cache](#instance-cache)
    - [Metrics Cache](#metrics-cache)
    - [Per-Instance Metrics](#per-instance-metrics)
+      - [host_alive_status Codes](#host_alive_status-codes)
    - [VM Dimensions](#vm-dimensions)
    - [Aggregate Metrics](#aggregate-metrics)
  - [Crash Dump Monitoring](#crash-dump-monitoring)
@ -1011,9 +1012,9 @@ If the owner of the VM is in a different tenant the Agent Cross-Tenant Metric Su

 `vm_probation` specifies a period of time (in seconds) in which to suspend metrics from a newly-created VM.  This is to prevent quickly-obsolete metrics in an environment with a high amount of instance churn (VMs created and destroyed in rapid succession).  The default probation length is 300 seconds (five minutes).  Setting to 0 disables VM probation, and metrics will be recorded as soon as possible after a VM is created.

-`ping_check` includes the command line (sans the IP address) used to perform a ping check against instances.  Set to False (or omit altogether) to disable ping checks.  This is automatically populated during `monasca-setup` from a list of possible `ping` command lines.  Generally, `fping` is preferred over `ping` because it can return a failure with sub-second resolution, but if `fping` does not exist on the system, `ping` will be used instead.  If ping_check is disabled, the `host_alive_status` metric will not be published unless that VM is inactive.  This is because the host status is inconclusive without a ping check.
+`ping_check` includes the command line (sans the IP address) used to perform a ping check against instances.  Set to False (or omit altogether) to disable ping checks.  This is automatically populated during `monasca-setup` from a list of possible `ping` command lines.  Generally, `fping` is preferred over `ping` because it can return a failure with sub-second resolution, but if `fping` does not exist on the system, `ping` will be used instead.

-`ping_only` will suppress all per-VM metrics aside from `host_alive_status` and `vm.host_alive_status`, including all I/O, network, memory, and CPU metrics.  [Aggregate Metrics](#aggregate-metrics), however, would still be enabled if `ping_only` is true.  By default, `ping_only` is false.  If both `ping_only` and `ping_check` are set to false, the only metrics published by the Libvirt plugin would be the Aggregate Metrics.
+`alive_only` will suppress all per-VM metrics aside from `host_alive_status` and `vm.host_alive_status`, including all I/O, network, memory, ping, and CPU metrics.  [Aggregate Metrics](#aggregate-metrics), however, would still be enabled if `alive_only` is true.  By default, `alive_only` is false.

 **Note:** Ping checks are not currently supported in compute environments that utilize network namespaces.  Neutron, by default, enables namespaces, and is therefore not supported at this time.  Ping checks are known to be functional with Nova networking when a guest network named 'private' is used.  In any other environment, ping checks are automatically disabled, and there will be no `host_alive_status` metric, except when the hypervisor sees that the VM has shut down (in which case the value of 2 is returned, as shown in [Per-Instance Metrics](#per-instance-metrics)).  Proper Neutron namespace support is planned for a future release.

@ -1029,7 +1030,7 @@ init_config:
    nova_refresh: 14400
    vm_probation: 300
    ping_check: /usr/bin/fping -n -c1 -t250 -q
-    ping_only: false
+    alive_only: false
 instances:
    - {}
 ```
@ -1039,7 +1040,7 @@ Note: If the Nova service login credentials are changed, `monasca-setup` would n

 Example `monasca-setup` usage:
 ```
-monasca-setup -d libvirt -a 'ping_check=false ping_only=false'
+monasca-setup -d libvirt -a 'ping_check=false alive_only=false' --overwrite
 ```

 ### Instance Cache
@ -1083,7 +1084,7 @@ instance-00000004:
 | Name                 | Description                            | Associated Dimensions  |
 | -------------------- | -------------------------------------- | ---------------------- |
 | cpu.utilization_perc | Overall CPU utilization (percentage)   |                        |
-| host_alive_status    | Returns status: 0=passes ping check, 1=fails ping check, 2=inactive  |         |
+| host_alive_status    | See [host_alive_status Codes](#host_alive_status-codes) below | |
 | io.read_ops_sec      | Disk I/O read operations per second    | 'device' (ie, 'hdd')   |
 | io.write_ops_sec     | Disk I/O write operations per second   | 'device' (ie, 'hdd')   |
 | io.read_bytes_sec    | Disk I/O read bytes per second         | 'device' (ie, 'hdd')   |
@ -1093,11 +1094,26 @@ instance-00000004:
 | net.out_packets_sec  | Network transmitted packets per second | 'device' (ie, 'vnet0') |
 | net.in_bytes_sec     | Network received bytes per second      | 'device' (ie, 'vnet0') |
 | net.out_bytes_sec    | Network transmitted bytes per second   | 'device' (ie, 'vnet0') |
-| mem.free_mb          | Free memory in Mbytes               |                        |
-| mem.total_mb         | Total memory in Mbytes              |                        |
-| mem.used_mb          | Used memory in Mbytes               |                        |
+| mem.free_mb          | Free memory in Mbytes                  |                        |
+| mem.total_mb         | Total memory in Mbytes                 |                        |
+| mem.used_mb          | Used memory in Mbytes                  |                        |
 | mem.free_perc        | Percent of memory free                 |                        |
-| mem.swap_used_mb     | Used swap space in Mbytes           |                        |
+| mem.swap_used_mb     | Used swap space in Mbytes              |                        |
+| ping_status          | 0 for ping success, 1 for ping failure |                        |
+
+#### host_alive_status Codes
+| Code | Description                          | value_meta 'detail'                    |
+| ---- | -------------------------------------|--------------------------------------- |
+| -1   | No state                             | VM has no state                        |
+|  0   | Running / OK                         | None                                   |
+|  1   | Idle / blocked                       | VM is blocked                          |
+|  2   | Paused                               | VM is paused                           |
+|  3   | Shutting down                        | VM is shutting down                    |
+|  4   | Shut off                             | VM has been shut off                   |
+|  4   | Nova suspend                         | VM has been suspended                  |
+|  5   | Crashed                              | VM has crashed                         |
+|  6   | Power management suspend (S3 state)  | VM is in power management (s3) suspend |
+

 Memory statistics require a balloon driver on the VM.  For the Linux kernel, this is the `CONFIG_VIRTIO_BALLOON` configuration parameter, active by default in Ubuntu, and enabled by default as a kernel module in Debian, CentOS, and SUSE.

--- a/monasca_agent/collector/checks_d/libvirt.py
+++ b/monasca_agent/collector/checks_d/libvirt.py
@ -16,6 +16,7 @@
 """Monasca Agent interface for libvirt metrics"""

 import json
+import libvirt
 import os
 import stat
 import subprocess
@ -27,6 +28,14 @@ from distutils.version import LooseVersion
 from monasca_agent.collector.checks import AgentCheck
 from monasca_agent.collector.virt import inspector

+DOM_STATES = {libvirt.VIR_DOMAIN_BLOCKED: 'VM is blocked',
+              libvirt.VIR_DOMAIN_CRASHED: 'VM has crashed',
+              libvirt.VIR_DOMAIN_NONE: 'VM has no state',
+              libvirt.VIR_DOMAIN_PAUSED: 'VM is paused',
+              libvirt.VIR_DOMAIN_PMSUSPENDED: 'VM is in power management (s3) suspend',
+              libvirt.VIR_DOMAIN_SHUTDOWN: 'VM is shutting down',
+              libvirt.VIR_DOMAIN_SHUTOFF: 'VM has been shut off'}
+

 class LibvirtCheck(AgentCheck):

@ -281,6 +290,32 @@ class LibvirtCheck(AgentCheck):
                    'timestamp': sample_time,
                    'value': value}

+    def _inspect_state(self, insp, inst, instance_cache, dims_customer, dims_operations):
+        """Look at the state of the instance, publish a metric using a
+           user-friendly description in the 'detail' metadata, and return
+           a status code (calibrated to UNIX status codes where 0 is OK)
+           so that remaining metrics can be skipped if the VM is not OK
+        """
+        inst_name = inst.name()
+        dom_status = inst.state()[0] - 1
+        metatag = None
+
+        if inst.state()[0] in DOM_STATES:
+            metatag = {'detail': DOM_STATES[inst.state()[0]]}
+        # A nova-suspended VM has a SHUTOFF Power State, but alternate Status
+        if inst.state() == [libvirt.VIR_DOMAIN_SHUTOFF, 5]:
+            metatag = {'detail': 'VM has been suspended'}
+
+        self.gauge('host_alive_status', dom_status, dimensions=dims_customer,
+                   delegated_tenant=instance_cache.get(inst_name)['tenant_id'],
+                   hostname=instance_cache.get(inst_name)['hostname'],
+                   value_meta=metatag)
+        self.gauge('vm.host_alive_status', dom_status,
+                   dimensions=dims_operations,
+                   value_meta=metatag)
+
+        return dom_status
+
    def check(self, instance):
        """Gather VM metrics for each instance"""

@ -331,16 +366,20 @@ class LibvirtCheck(AgentCheck):
                self.log.error("{0} is not known to nova after instance cache update -- skipping this ghost VM.".format(inst_name))
                continue

-            # Skip instances that are inactive
-            if inst.isActive() == 0:
-                detail = 'Instance is not active'
-                self.gauge('host_alive_status', 2, dimensions=dims_customer,
-                           delegated_tenant=instance_cache.get(inst_name)['tenant_id'],
-                           hostname=instance_cache.get(inst_name)['hostname'],
-                           value_meta={'detail': detail})
-                self.gauge('vm.host_alive_status', 2, dimensions=dims_operations,
-                           value_meta={'detail': detail})
+            # Accumulate aggregate data
+            for gauge in agg_gauges:
+                if gauge in instance_cache.get(inst_name):
+                    agg_values[gauge] += instance_cache.get(inst_name)[gauge]
+
+            # Skip further processing on VMs that are not in an active state
+            if self._inspect_state(insp, inst, instance_cache,
+                                   dims_customer, dims_operations) != 0:
                continue
+
+            # Skip the remainder of the checks if alive_only is True in the config
+            if self.init_config.get('alive_only'):
+                continue
+
            if inst_name not in metric_cache:
                metric_cache[inst_name] = {}

@ -351,39 +390,6 @@ class LibvirtCheck(AgentCheck):
                                                                                         vm_probation_remaining))
                continue

-            # Test instance's general responsiveness (ping check) if so configured
-            if self.init_config.get('ping_check') and 'private_ip' in instance_cache.get(inst_name):
-                detail = 'Ping check OK'
-                ping_cmd = self.init_config.get('ping_check').split()
-                ping_cmd.append(instance_cache.get(inst_name)['private_ip'])
-                with open(os.devnull, "w") as fnull:
-                    try:
-                        res = subprocess.call(ping_cmd,
-                                              stdout=fnull,
-                                              stderr=fnull)
-                        if res > 0:
-                            detail = 'Host failed ping check'
-                        self.gauge('host_alive_status', res, dimensions=dims_customer,
-                                   delegated_tenant=instance_cache.get(inst_name)['tenant_id'],
-                                   hostname=instance_cache.get(inst_name)['hostname'],
-                                   value_meta={'detail': detail})
-                        self.gauge('vm.host_alive_status', res, dimensions=dims_operations,
-                                   value_meta={'detail': detail})
-                        # Do not attempt to process any more metrics for offline hosts
-                        if res > 0:
-                            continue
-                    except OSError as e:
-                        self.log.warn("OS error running '{0}' returned {1}".format(ping_cmd, e))
-
-            # Skip the remainder of the checks if ping_only is True in the config
-            if self.init_config.get('ping_only'):
-                continue
-
-            # Accumulate aggregate data
-            for gauge in agg_gauges:
-                if gauge in instance_cache.get(inst_name):
-                    agg_values[gauge] += instance_cache.get(inst_name)[gauge]
-
            self._inspect_cpu(insp, inst, instance_cache, metric_cache, dims_customer, dims_operations)
            self._inspect_disks(insp, inst, instance_cache, metric_cache, dims_customer, dims_operations)
            self._inspect_network(insp, inst, instance_cache, metric_cache, dims_customer, dims_operations)
@ -405,6 +411,25 @@ class LibvirtCheck(AgentCheck):
            except KeyError:
                self.log.debug("Balloon driver not active/available on guest {0} ({1})".format(inst_name,
                                                                                               instance_cache.get(inst_name)['hostname']))
+            # Test instance's remote responsiveness (ping check) if so configured
+            # NOTE: This is only supported for Nova networking at this time.
+            if self.init_config.get('ping_check') and 'private_ip' in instance_cache.get(inst_name):
+                ping_cmd = self.init_config.get('ping_check').split()
+                ping_cmd.append(instance_cache.get(inst_name)['private_ip'])
+                with open(os.devnull, "w") as fnull:
+                    try:
+                        res = subprocess.call(ping_cmd,
+                                              stdout=fnull,
+                                              stderr=fnull)
+                        self.gauge('ping_status', res, dimensions=dims_customer,
+                                   delegated_tenant=instance_cache.get(inst_name)['tenant_id'],
+                                   hostname=instance_cache.get(inst_name)['hostname'])
+                        self.gauge('vm.ping_status', res, dimensions=dims_operations)
+                        # Do not attempt to process any more metrics for offline hosts
+                        if res > 0:
+                            continue
+                    except OSError as e:
+                        self.log.warn("OS error running '{0}' returned {1}".format(ping_cmd, e))

        # Save these metrics for the next collector invocation
        self._update_metric_cache(metric_cache)