Add SolidFire Monasca Check Plugin:

This adds a very basic plugin that monitors SolidFire storage clusters. The
plugin provides cluster fault and usage metrics. This uses a stripped down
version of the Cinder volume driver to provide connectivity to the cluster.

Change-Id: Ibd0d577b6bf45747a0daaec2c5c3479741ae0937
This commit is contained in:
Chris Morrell 2016-07-12 21:44:47 +00:00
parent b004344dc5
commit fc214a8596
3 changed files with 312 additions and 0 deletions

View File

@ -0,0 +1,12 @@
# Copyright (c) 2016 NetApp, Inc.
init_config:
instances:
# Each cluster can be monitored with a separate instance.
- name: rack_d_cluster
# Cluser admin with reporting permissions.
username: monasca_admin
# Cluster admin password
password: secret_password
# Cluster MVIP address, must be reachable.
mvip: 192.168.1.1

View File

@ -65,6 +65,7 @@
- [RabbitMQ Checks](#rabbitmq-checks)
- [RedisDB](#redisdb)
- [Riak](#riak)
- [SolidFire](#solidfire)
- [SQLServer](#sqlserver)
- [Supervisord](#supervisord)
- [Swift Diags](#swift-diags)
@ -156,6 +157,7 @@ The following plugins are delivered via setup as part of the standard plugin che
| rabbitmq | /root/.rabbitmq.cnf |
| redisdb | | |
| riak | | |
| solidfire | | Track cluster health and use stats |
| sqlserver | | |
| supervisord | | |
| swift_diags | | |
@ -1378,6 +1380,45 @@ See [the example configuration](https://github.com/openstack/monasca-agent/blob/
## Riak
See [the example configuration](https://github.com/openstack/monasca-agent/blob/master/conf.d/riak.yaml.example) for how to configure the Riak plugin.
## SolidFire
The SolidFire checks require a matching solidfire.yaml to be present. Currently the checks report a mixture of cluster utilization and health metrics. Multiple clusters can be monitored via separate instance stanzas in the config file.
Sample config:
instances:
- name: cluster_rack_d
username: cluster_admin
password: secret_password
mvip: 192.168.1.1
The SolidFire checks return the following metrics:
| Metric Name | Dimensions | Semantics |
| ----------- | ---------- | --------- |
| solidfire.active_cluster_faults | service=solidfire, cluster | Amount of active cluster faults, such as failed drives |
| solidfire.cluster_utilization | service=solidfire, cluster | Overall cluster IOP utilization |
| solidfire.num_iscsi_sessions | service=solidfire, cluster | Amount of active iSCSI sessions connected to the cluster |
| solidfire.iops.avg_5_sec | service=solidfire, cluster | Average IOPs over the last 5 seconds |
| solidfire.iops.avg_utc | service=solidfire, cluster | Average IOPs since midnight UTC |
| solidfire.iops.peak_utc | service=solidfire, cluster | Peak IOPS since midnight UTC |
| solidfire.iops.max_available | service=solidfire, cluster | Theoretical maximum amount of IOPs |
| solidfire.active_block_bytes | service=solidfire, cluster | Amount of space consumed by the block services, including cruft |
| solidfire.active_meta_bytes | service=solidfire, cluster | Amount of space consumed by the metadata services |
| solidfire.active_snapshot_bytes | service=solidfire, cluster | Amount of space consumed by the metadata services for snapshots |
| solidfire.provisioned_bytes | service=solidfire, cluster | Total number of provisioned bytes |
| solidfire.unique_blocks_used_bytes | service=solidfire, cluster | Amount of space the unique blocks take on the block drives |
| solidfire.max_block_bytes | service=solidfire, cluster | Maximum amount of bytes allocated to the block services |
| solidfire.max_meta_bytes | service=solidfire, cluster | Maximum amount of bytes allocated to the metadata services |
| solidfire.max_provisioned_bytes | service=solidfire, cluster | Max provisionable space if 100% metadata space used |
| solidfire.max_overprovisioned_bytes | service=solidfire, cluster | Max provisionable space * 5, artificial safety limit |
| solidfire.unique_blocks | service=solidfire, cluster | Number of blocks(not always 4KiB) stored on block drives |
| solidfire.non_zero_blocks | service=solidfire, cluster | Number of 4KiB blocks with data after the last garbage collection |
| solidfire.zero_blocks | service=solidfire, cluster | Number of 4KiB blocks without data after the last garbage collection |
| solidfire.thin_provision_factor | service=solidfire, cluster | Thin provisioning factor, (nonZeroBlocks + zeroBlocks) / nonZeroBlocks |
| solidfire.deduplication_factor | service=solidfire, cluster | Data deduplication factor, nonZeroBlocks / uniqueBlocks |
| solidfire.compression_factor | service=solidfire, cluster | Data compression factor, (uniqueBlocks * 4096) / uniqueBlocksUsedSpace |
| solidfire.data_reduction_factor | service=solidfire, cluster | Aggregate data reduction efficiency, thin_prov * dedup * compression |
## SQLServer
See [the example configuration](https://github.com/openstack/monasca-agent/blob/master/conf.d/sqlserver.yaml.example) for how to configure the SQLServer plugin.

View File

@ -0,0 +1,259 @@
import json
import logging
import time
import requests
from requests.packages.urllib3 import exceptions
import six
import warnings
import monasca_agent.collector.checks as checks
LOG = logging.getLogger(__name__)
class SolidFire(checks.AgentCheck):
"""SolidFire plugin for reporting cluster metrics. Reference the general
plugin documentation for metric specifics.
"""
def __init__(self, name, init_config, agent_config):
super(SolidFire, self).__init__(name, init_config, agent_config)
self.sf = None
self.instance = None
self.cluster = None
def check(self, instance):
"""Pull down cluster stats."""
self.cluster = instance.get('name')
dimensions = {'service': 'solidfire',
'cluster': self.cluster}
data = {}
num_metrics = 0
# Extract cluster auth information
auth = self._pull_auth(instance)
self.sf = SolidFireLib(auth)
# Query cluster for stats
data.update(self._get_cluster_stats())
# Query for active cluster faults.
data.update(self._list_cluster_faults())
# Query for cluster capacity info
data.update(self._get_cluster_capacity())
# Dump data upstream.
for key, value in data.iteritems():
if data[key] is None:
continue
self.gauge(key, value, dimensions)
num_metrics += 1
LOG.debug('Collected %s metrics' % (num_metrics))
def _pull_auth(self, instance):
"""Extract auth data from instance data.
Simple check to verify we have enough auth information to connect
to the SolidFire cluster.
"""
for k in ['mvip', 'username', 'password']:
if k not in instance:
msg = 'Missing config value: %s' % (k)
LOG.error(msg)
raise Exception(msg)
auth = {'mvip': instance.get('mvip'),
'port': instance.get('port', 443),
'login': instance.get('username'),
'passwd': instance.get('password')}
auth['url'] = 'https://%s:%s' % (auth['mvip'],
auth['port'])
return auth
def _get_cluster_stats(self):
res = (self.sf.issue_api_request('GetClusterStats', {}, '8.0')
['result']['clusterStats'])
# Cluster utilization is the overall load.
data = {'solidfire.cluster_utilization': res['clusterUtilization']}
return data
def _get_cluster_capacity(self):
res = (self.sf.issue_api_request('GetClusterCapacity', {}, '8.0')
['result']['clusterCapacity'])
# Number of 4KiB blocks with data after the last garbage collection
non_zero_blocks = res['nonZeroBlocks']
# Number of 4KiB blocks without data after the last garbage collection
zero_blocks = res['zeroBlocks']
# Number of blocks(not always 4KiB) stored on block drives.
unique_blocks = res['uniqueBlocks']
# Amount of space the unique blocks take on the block drives.
unique_blocks_space = res['uniqueBlocksUsedSpace']
# Amount of space consumed by the block services, including cruft.
active_block_space = res['activeBlockSpace']
# Maximum amount of bytes allocated to the block services.
max_block_space = res['maxUsedSpace']
# Amount of space consumed by the metadata services.
active_slice_space = res['usedMetadataSpace']
# Amount of space consumed by the metadata services for snapshots.
active_snap_space = res['usedMetadataSpaceInSnapshots']
# Maximum amount of bytes allocated to the metadata services.
max_slice_space = res['maxUsedMetadataSpace']
# Volume provisioned space
prov_space = res['provisionedSpace']
# Max provisionable space if 100% metadata space used.
max_prov_space = res['maxProvisionedSpace']
# Overprovision limit.
max_overprov_space = res['maxOverProvisionableSpace']
# Number of active iSCSI sessions.
iscsi_sessions = res['activeSessions']
# Average IOPS since midnight UTC.
avg_iops = res['averageIOPS']
# Peak IOPS since midnight UTC.
peak_iops = res['peakIOPS']
# Current IOPs over the last 5 seconds.
current_iops = res['currentIOPS']
# Theoretical max IOPS
max_iops = res['maxIOPS']
# Single-node clusters can report zero values for some divisors.
thin_factor, dedup_factor, comp_factor = 1, 1, 1
# Same calculations used in the SolidFire UI.
if non_zero_blocks:
# Thin provisioning factor
thin_factor = ((non_zero_blocks + zero_blocks) /
float(non_zero_blocks))
if unique_blocks:
# Data deduplication factor
dedup_factor = non_zero_blocks / float(unique_blocks)
if unique_blocks_space:
# 4096 constant from our internal block size, pre-compression
# Compression efficiency factor
comp_factor = (unique_blocks * 4096) / float(unique_blocks_space)
# Overall data reduction efficiency factor
eff_factor = thin_factor * dedup_factor * comp_factor
data = {'solidfire.num_iscsi_sessions': iscsi_sessions,
'solidfire.iops.avg_utc': avg_iops,
'solidfire.iops.peak_utc': peak_iops,
'solidfire.iops.avg_5_sec': current_iops,
'solidfire.iops.max_available': max_iops,
'solidfire.provisioned_bytes': prov_space,
'solidfire.max_provisioned_bytes': max_prov_space,
'solidfire.max_overprovisioned_bytes': max_overprov_space,
'solidfire.max_block_bytes': max_block_space,
'solidfire.active_block_bytes': active_block_space,
'solidfire.max_meta_bytes': max_slice_space,
'solidfire.active_meta_bytes': active_slice_space,
'solidfire.active_snapshot_bytes': active_snap_space,
'solidfire.non_zero_blocks': non_zero_blocks,
'solidfire.zero_blocks': zero_blocks,
'solidfire.unique_blocks': unique_blocks,
'solidfire.unique_blocks_used_bytes': unique_blocks_space,
'solidfire.thin_provision_factor': thin_factor,
'solidfire.deduplication_factor': dedup_factor,
'solidfire.compression_factor': comp_factor,
'solidfire.data_reduction_factor': eff_factor
}
return data
def _list_cluster_faults(self):
# Report the number of active faults. Might be useful for an alarm?
res = (self.sf.issue_api_request('ListClusterFaults',
{'faultTypes': 'current'},
'8.0')
['result']['faults'])
data = {'solidfire.active_cluster_faults': len(res)}
return data
def retry(exc_tuple, tries=5, delay=1, backoff=2):
# Retry decorator used for issuing API requests.
def retry_dec(f):
@six.wraps(f)
def func_retry(*args, **kwargs):
_tries, _delay = tries, delay
while _tries > 1:
try:
return f(*args, **kwargs)
except exc_tuple:
time.sleep(_delay)
_tries -= 1
_delay *= backoff
LOG.debug('Retrying %(args)s, %(tries)s attempts '
'remaining...',
{'args': args, 'tries': _tries})
msg = ('Retry count exceeded for command: %s' %
(args[1]))
LOG.error(msg)
raise Exception(msg)
return func_retry
return retry_dec
class SolidFireLib(object):
"""Gutted version of the Cinder driver.
Just enough to communicate with a SolidFire cluster for POC.
"""
retryable_errors = ['xDBVersionMismatch',
'xMaxSnapshotsPerVolumeExceeded',
'xMaxClonesPerVolumeExceeded',
'xMaxSnapshotsPerNodeExceeded',
'xMaxClonesPerNodeExceeded',
'xNotReadyForIO']
retry_exc_tuple = (requests.exceptions.ConnectionError)
def __init__(self, auth):
self.endpoint = auth
self.active_cluster_info = {}
self._set_active_cluster_info(auth)
@retry(retry_exc_tuple, tries=6)
def issue_api_request(self, method, params, version='1.0', endpoint=None):
if params is None:
params = {}
if endpoint is None:
endpoint = self.active_cluster_info['endpoint']
payload = {'method': method, 'params': params}
url = '%s/json-rpc/%s/' % (endpoint['url'], version)
with warnings.catch_warnings():
warnings.simplefilter("ignore", exceptions.InsecureRequestWarning)
req = requests.post(url,
data=json.dumps(payload),
auth=(endpoint['login'], endpoint['passwd']),
verify=False,
timeout=30)
response = req.json()
req.close()
if (('error' in response) and
(response['error']['name'] in self.retryable_errors)):
msg = ('Retryable error (%s) encountered during '
'SolidFire API call.' % response['error']['name'])
raise Exception(msg)
if 'error' in response:
msg = ('API response: %s') % response
raise Exception(msg)
return response
def _set_active_cluster_info(self, endpoint):
self.active_cluster_info['endpoint'] = endpoint
for k, v in self.issue_api_request(
'GetClusterInfo',
{})['result']['clusterInfo'].items():
self.active_cluster_info[k] = v
# Add a couple extra things that are handy for us
self.active_cluster_info['clusterAPIVersion'] = (
self.issue_api_request('GetClusterVersionInfo',
{})['result']['clusterAPIVersion'])