StackLight 0.10.0 documentation updates

Change-Id: I13d1de2c2984c09e79c68301a46d35479f9afb21
This commit is contained in:
Patrick Petit 2016-07-06 19:22:12 +02:00
parent 4cac7500fb
commit 0ef08aaa4a
7 changed files with 1587 additions and 421 deletions

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,736 @@
.. _alarm_list:
Appendix C: List of built-in alarms
===================================
Here is a list of all the alarms that are built-in in StackLight::
alarms:
- name: 'cpu-critical-controller'
description: 'The CPU usage is too high (controller node)'
severity: 'critical'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: cpu_idle
relational_operator: '<='
threshold: 5
window: 120
periods: 0
function: avg
- metric: cpu_wait
relational_operator: '>='
threshold: 35
window: 120
periods: 0
function: avg
- name: 'cpu-warning-controller'
description: 'The CPU usage is high (controller node)'
severity: 'warning'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: cpu_idle
relational_operator: '<='
threshold: 15
window: 120
periods: 0
function: avg
- metric: cpu_wait
relational_operator: '>='
threshold: 25
window: 120
periods: 0
function: avg
- name: 'cpu-critical-compute'
description: 'The CPU usage is too high (compute node)'
severity: 'critical'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: cpu_wait
relational_operator: '>='
threshold: 30
window: 120
periods: 0
function: avg
- name: 'cpu-warning-compute'
description: 'The CPU usage is high (compute node)'
severity: 'warning'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: cpu_wait
relational_operator: '>='
threshold: 20
window: 120
periods: 0
function: avg
- name: 'cpu-critical-rabbitmq'
description: 'The CPU usage is too high (RabbitMQ node)'
severity: 'critical'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: cpu_idle
relational_operator: '<='
threshold: 5
window: 120
periods: 0
function: avg
- name: 'cpu-warning-rabbitmq'
description: 'The CPU usage is high (RabbitMQ node)'
severity: 'warning'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: cpu_idle
relational_operator: '<='
threshold: 15
window: 120
periods: 0
function: avg
- name: 'cpu-critical-mysql'
description: 'The CPU usage is too high (MySQL node)'
severity: 'critical'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: cpu_idle
relational_operator: '<='
threshold: 5
window: 120
periods: 0
function: avg
- name: 'cpu-warning-mysql'
description: 'The CPU usage is high (MySQL node)'
severity: 'warning'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: cpu_idle
relational_operator: '<='
threshold: 15
window: 120
periods: 0
function: avg
- name: 'cpu-critical-storage'
description: 'The CPU usage is too high (storage node)'
severity: 'critical'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: cpu_wait
relational_operator: '>='
threshold: 40
window: 120
periods: 0
function: avg
- metric: cpu_idle
relational_operator: '<='
threshold: 5
window: 120
periods: 0
function: avg
- name: 'cpu-warning-storage'
description: 'The CPU usage is high (storage node)'
severity: 'warning'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: cpu_wait
relational_operator: '>='
threshold: 30
window: 120
periods: 0
function: avg
- metric: cpu_idle
relational_operator: '<='
threshold: 15
window: 120
periods: 0
function: avg
- name: 'cpu-critical-default'
description: 'The CPU usage is too high'
severity: 'critical'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: cpu_wait
relational_operator: '>='
threshold: 35
window: 120
periods: 0
function: avg
- metric: cpu_idle
relational_operator: '<='
threshold: 5
window: 120
periods: 0
function: avg
- name: 'rabbitmq-disk-limit-critical'
description: 'RabbitMQ has reached the free disk threshold. All producers are blocked'
severity: 'critical'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: rabbitmq_remaining_disk
relational_operator: '<='
threshold: 0
window: 20
periods: 0
function: min
- name: 'rabbitmq-disk-limit-warning'
description: 'RabbitMQ is getting close to the free disk threshold'
severity: 'warning'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: rabbitmq_remaining_disk
relational_operator: '<='
threshold: 104857600 # 100MB
window: 20
periods: 0
function: min
- name: 'rabbitmq-memory-limit-critical'
description: 'RabbitMQ has reached the memory threshold. All producers are blocked'
severity: 'critical'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: rabbitmq_remaining_memory
relational_operator: '<='
threshold: 0
window: 20
periods: 0
function: min
- name: 'rabbitmq-memory-limit-warning'
description: 'RabbitMQ is getting close to the memory threshold'
severity: 'warning'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: rabbitmq_remaining_memory
relational_operator: '<='
threshold: 104857600 # 100MB
window: 20
periods: 0
function: min
- name: 'rabbitmq-queue-warning'
description: 'The number of outstanding messages is too high'
severity: 'warning'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: rabbitmq_messages
relational_operator: '>='
threshold: 200
window: 120
periods: 0
function: avg
- name: 'apache-warning'
description: 'There is no Apache idle workers available'
severity: 'warning'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: apache_idle_workers
relational_operator: '=='
threshold: 0
window: 60
periods: 0
function: min
- name: 'log-fs-warning'
description: "The log filesystem's free space is low"
severity: 'warning'
enabled: 'true'
trigger:
rules:
- metric: fs_space_percent_free
fields:
fs: '/var/log'
relational_operator: '<'
threshold: 10
window: 60
periods: 0
function: min
- name: 'log-fs-critical'
description: "The log filesystem's free space is too low"
severity: 'critical'
enabled: 'true'
trigger:
rules:
- metric: fs_space_percent_free
fields:
fs: '/var/log'
relational_operator: '<'
threshold: 5
window: 60
periods: 0
function: min
- name: 'root-fs-warning'
description: "The root filesystem's free space is low"
severity: 'warning'
enabled: 'true'
trigger:
rules:
- metric: fs_space_percent_free
fields:
fs: '/'
relational_operator: '<'
threshold: 5
window: 60
periods: 0
function: min
- name: 'root-fs-critical'
description: "The root filesystem's free space is too low"
severity: 'critical'
enabled: 'true'
trigger:
rules:
- metric: fs_space_percent_free
fields:
fs: '/'
relational_operator: '<'
threshold: 2
window: 60
periods: 0
function: min
- name: 'mysql-fs-warning'
description: "The MySQL filesystem's free space is low"
severity: 'warning'
enabled: 'true'
trigger:
rules:
- metric: fs_space_percent_free
fields:
fs: '/var/lib/mysql'
relational_operator: '<'
threshold: 5
window: 60
periods: 0
function: min
- name: 'mysql-fs-critical'
description: "The MySQL filesystem's free space is too low"
severity: 'critical'
enabled: 'true'
trigger:
rules:
- metric: fs_space_percent_free
fields:
fs: '/var/lib/mysql'
relational_operator: '<'
threshold: 2
window: 60
periods: 0
function: min
- name: 'nova-fs-warning'
description: "The filesystem's free space is low (compute node)"
severity: 'warning'
enabled: 'true'
trigger:
rules:
- metric: fs_space_percent_free
fields:
fs: '/var/lib/nova'
relational_operator: '<'
threshold: 10
window: 60
periods: 0
function: min
- name: 'nova-fs-critical'
description: "The filesystem's free space is too low (compute node)"
severity: 'critical'
enabled: 'true'
trigger:
rules:
- metric: fs_space_percent_free
fields:
fs: '/var/lib/nova'
relational_operator: '<'
threshold: 5
window: 60
periods: 0
function: min
- name: 'nova-api-http-errors'
description: 'Too many 5xx HTTP errors have been detected on nova-api'
severity: 'warning'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: haproxy_backend_response_5xx
fields:
backend: 'nova-api'
relational_operator: '>'
threshold: 0
window: 60
periods: 1
function: diff
- name: 'nova-logs-error'
description: 'Too many errors have been detected in Nova logs'
severity: 'warning'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: log_messages
fields:
service: 'nova'
level: 'error'
relational_operator: '>'
threshold: 0.1
window: 70
periods: 0
function: max
- name: 'heat-api-http-errors'
description: 'Too many 5xx HTTP errors have been detected on heat-api'
severity: 'warning'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: haproxy_backend_response_5xx
fields:
backend: 'heat-api'
relational_operator: '>'
threshold: 0
window: 60
periods: 1
function: diff
- name: 'heat-logs-error'
description: 'Too many errors have been detected in Heat logs'
severity: 'warning'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: log_messages
fields:
service: 'heat'
level: 'error'
relational_operator: '>'
threshold: 0.1
window: 70
periods: 0
function: max
- name: 'swift-api-http-errors'
description: 'Too many 5xx HTTP errors have been detected on swift-api'
severity: 'warning'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: haproxy_backend_response_5xx
fields:
backend: 'swift-api'
relational_operator: '>'
threshold: 0
window: 60
periods: 1
function: diff
- name: 'cinder-api-http-errors'
description: 'Too many 5xx HTTP errors have been detected on cinder-api'
severity: 'warning'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: haproxy_backend_response_5xx
fields:
backend: 'cinder-api'
relational_operator: '>'
threshold: 0
window: 60
periods: 1
function: diff
- name: 'cinder-logs-error'
description: 'Too many errors have been detected in Cinder logs'
severity: 'warning'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: log_messages
fields:
service: 'cinder'
level: 'error'
relational_operator: '>'
threshold: 0.1
window: 70
periods: 0
function: max
- name: 'glance-api-http-errors'
description: 'Too many 5xx HTTP errors have been detected on glance-api'
severity: 'warning'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: haproxy_backend_response_5xx
fields:
backend: 'glance-api'
relational_operator: '>'
threshold: 0
window: 60
periods: 1
function: diff
- name: 'glance-logs-error'
description: 'Too many errors have been detected in Glance logs'
severity: 'warning'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: log_messages
fields:
service: 'glance'
level: 'error'
relational_operator: '>'
threshold: 0.1
window: 70
periods: 0
function: max
- name: 'neutron-api-http-errors'
description: 'Too many 5xx HTTP errors have been detected on neutron-api'
severity: 'warning'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: haproxy_backend_response_5xx
fields:
backend: 'neutron-api'
relational_operator: '>'
threshold: 0
window: 60
periods: 1
function: diff
- name: 'neutron-logs-error'
description: 'Too many errors have been detected in Neutron logs'
severity: 'warning'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: log_messages
fields:
service: 'neutron'
level: 'error'
relational_operator: '>'
threshold: 0.1
window: 70
periods: 0
function: max
- name: 'keystone-public-api-http-errors'
description: 'Too many 5xx HTTP errors have been detected on keystone-public-api'
severity: 'warning'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: haproxy_backend_response_5xx
fields:
backend: 'keystone-public-api'
relational_operator: '>'
threshold: 0
window: 60
periods: 1
function: diff
- name: 'keystone-admin-api-http-errors'
description: 'Too many 5xx HTTP errors have been detected on keystone-admin-api'
severity: 'warning'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: haproxy_backend_response_5xx
fields:
backend: 'keystone-admin-api'
relational_operator: '>'
threshold: 0
window: 60
periods: 1
function: diff
- name: 'keystone-logs-error'
description: 'Too many errors have been detected in Keystone logs'
severity: 'warning'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: log_messages
fields:
service: 'keystone'
level: 'error'
relational_operator: '>'
threshold: 0.1
window: 70
periods: 0
function: max
- name: 'mysql-node-connected'
description: 'The MySQL service has lost connectivity with the other nodes'
severity: 'critical'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: mysql_cluster_connected
relational_operator: '=='
threshold: 0
window: 30
periods: 1
function: min
- name: 'mysql-node-ready'
description: "The MySQL service isn't ready to serve queries"
severity: 'critical'
enabled: 'true'
trigger:
logical_operator: 'or'
rules:
- metric: mysql_cluster_ready
relational_operator: '=='
threshold: 0
window: 30
periods: 1
function: min
- name: 'ceph-health-critical'
description: 'Ceph health is critical'
severity: 'critical'
enabled: 'true'
trigger:
rules:
- metric: ceph_health
relational_operator: '=='
threshold: 3 # HEALTH_ERR
window: 60
function: max
- name: 'ceph-health-warning'
description: 'Ceph health is warning'
severity: 'warning'
enabled: 'true'
trigger:
rules:
- metric: ceph_health
relational_operator: '=='
threshold: 2 # HEALTH_WARN
window: 60
function: max
- name: 'ceph-capacity-critical'
description: 'Ceph free capacity is too low'
severity: 'critical'
enabled: 'true'
trigger:
rules:
- metric: ceph_pool_total_percent_free
relational_operator: '<'
threshold: 2
window: 60
function: max
- name: 'ceph-capacity-warning'
description: 'Ceph free capacity is low'
severity: 'warning'
enabled: 'true'
trigger:
rules:
- metric: ceph_pool_total_percent_free
relational_operator: '<'
threshold: 5
window: 60
function: max
- name: 'elasticsearch-health-critical'
description: 'Elasticsearch cluster health is critical'
severity: 'critical'
enabled: 'true'
trigger:
rules:
- metric: elasticsearch_cluster_health
relational_operator: '=='
threshold: 3 # red
window: 60
function: min
- name: 'elasticsearch-health-warning'
description: 'Elasticsearch health is warning'
severity: 'warning'
enabled: 'true'
trigger:
rules:
- metric: elasticsearch_cluster_health
relational_operator: '=='
threshold: 2 # yellow
window: 60
function: min
- name: 'elasticsearch-fs-warning'
description: "The filesystem's free space is low (Elasticsearch node)"
severity: 'warning'
enabled: 'true'
trigger:
rules:
- metric: fs_space_percent_free
fields:
fs: '/opt/es/data' # Real FS is /opt/es-data but Collectd substituted '/' by '-'
relational_operator: '<'
threshold: 20
window: 60
periods: 0
function: min
- name: 'elasticsearch-fs-critical'
description: "The filesystem's free space is too low (Elasticsearch node)"
severity: 'critical'
enabled: 'true'
trigger:
rules:
- metric: fs_space_percent_free
fields:
fs: '/opt/es/data' # Real FS is /opt/es-data but Collectd substituted '/' by '-'
relational_operator: '<'
threshold: 15
window: 60
periods: 0
function: min
- name: 'influxdb-fs-warning'
description: "The filesystem's free space is low (InfluxDB node)"
severity: 'warning'
enabled: 'true'
trigger:
rules:
- metric: fs_space_percent_free
fields:
fs: '/var/lib/influxdb'
relational_operator: '<'
threshold: 10
window: 60
periods: 0
function: min
- name: 'influxdb-fs-critical'
description: "The filesystem's free space is too low (InfluxDB node)"
severity: 'critical'
enabled: 'true'
trigger:
rules:
- metric: fs_space_percent_free
fields:
fs: '/var/lib/influxdb'
relational_operator: '<'
threshold: 5
window: 60
periods: 0
function: min

View File

@ -43,7 +43,7 @@ source_suffix = '.rst'
master_doc = 'index'
# General information about the project.
project = u'The LMA Collector Plugin for Fuel'
project = u'The StackLight Collector Plugin for Fuel'
copyright = u'2015, Mirantis Inc.'
# The version info for the project you're documenting, acts as replacement for
@ -198,7 +198,7 @@ latex_elements = {
# (source start file, target name, title,
# author, documentclass [howto, manual, or own class]).
latex_documents = [
('index', 'LMAcollector.tex', u'The LMA Collector Plugin for Fuel Documentation',
('index', 'LMAcollector.tex', u'The StackLight Collector Plugin for Fuel Documentation',
u'Mirantis Inc.', 'manual'),
]
@ -228,7 +228,7 @@ latex_documents = [
# One entry per manual page. List of tuples
# (source start file, name, description, authors, manual section).
man_pages = [
('index', 'lmacollector', u'The LMA Collector Plugin for Fuel Documentation',
('index', 'lmacollector', u'The StackLight Collector Plugin for Fuel Documentation',
[u'Mirantis Inc.'], 1)
]
@ -242,7 +242,7 @@ man_pages = [
# (source start file, target name, title, author,
# dir menu entry, description, category)
texinfo_documents = [
('index', 'LMAcollector', u'The LMA Collector Plugin for Fuel Documentation',
('index', 'LMAcollector', u'The StackLight Collector Plugin for Fuel Documentation',
u'Mirantis Inc.', 'LMAcollector', 'One line description of project.',
'Miscellaneous'),
]

View File

@ -126,3 +126,65 @@ use the instructions below to troubleshoot the problem:
5. Check if the nodes are able to connect to the Elasticsearch server on port 9200.
6. Check if the nodes are able to connect to the InfluxDB server on port 8086.
.. _diagnostic:
Diagnostic Tool
---------------
A **global diagnostic tool** is installed on the Fuel Master node
by the StackLight Collector Plugin. The global diagnostic tool checks
that StackLight is configured and running properly across the entire
LMA toolchain for all the nodes that ready in your OpenStack environment::
[root@nailgun ~]# /var/www/nailgun/plugins/lma_collector-<version>/contrib/tools/diagnostic.sh
Running lma_diagnostic tool on all available nodes (this can take several minutes)
The diagnostic archive is here: /var/lma_diagnostics.2016-06-10_11-23-1465557820.tgz
Note that a global diagnostic can take several minutes.
All the results are consolidated in an archive file with the
name ``/var/lma_diagnostics.[date +%Y-%m-%d_%H-%M-%s].tgz``.
Instead of running a global diagnostic, you may want to run the diagnostic
on individual nodes. The tool will figure out what checks should be executed
based on the role of the node as shown below::
root@node-3:~# hiera roles
["controller"]
root@node-3:~# lma_diagnostics
2016-06-10-11-08-04 INFO node-3.test.domain.local role ["controller"]
2016-06-10-11-08-04 INFO ** LMA Collector
2016-06-10-11-08-04 INFO 2 process(es) 'hekad -config' found
2016-06-10-11-08-04 INFO 1 process(es) hekad is/are listening on port 4352
2016-06-10-11-08-04 INFO 1 process(es) hekad is/are listening on port 8325
2016-06-10-11-08-05 INFO 1 process(es) hekad is/are listening on port 5567
2016-06-10-11-08-05 INFO 1 process(es) hekad is/are listening on port 4353
[...]
In the example above, the diagnostic tool reports that two *hekad*
processes are runing on *node-3* which is the expected outcome.
In the case where one *hekad* process is not be running, the
diagnostic tool would report an error as shown below::
root@node-3:~# lma_diagnostics
2016-06-10-11-11-48 INFO node-3.test.domain.local role ["controller"]
2016-06-10-11-11-48 INFO ** LMA Collector
2016-06-10-11-11-48 **ERROR 1 'hekad -config' processes found, 2 expected!**
2016-06-10-11-11-48 **ERROR 'hekad' process does not LISTEN on port: 4352**
[...]
Here, two errors are reported:
1. There is only one *hekad* process running instead of two.
2. No *hekad* process is listening on port 4352.
This is one example of the type of checks performed by the
diagnostic tool but there are many others.
On the OpenStack nodes, the diagnostic's results are stored
in ``/var/lma_diagnostics/diagnostics.log``.
**A successful LMA toolchain diagnostic should be free of errors**.

View File

@ -13,6 +13,7 @@ Welcome to the StackLight Collector Documentation!
licenses
appendix_a
appendix_b
appendix_c
Indices and Tables
==================

View File

@ -3,76 +3,84 @@
Overview
========
The LMA Collector is the advanced monitoring agent of the
so called Logging, Monitoring and Alerting (LMA) Toolchain of Mirantis OpenStack,
which is now officially called the **StackLight Collector** (or just the *collector*).
The **StackLight Collector Plugin** is used to install and configure
several software components that are used to collect and process all the
data that we think is relevant to provide deep operational insights about
your OpenStack environment. These finely integrated components are
collectively referred to as the **StackLight Collector** (or just **the Collector**).
The StackLight Collector should be installed on each of the OpenStack nodes you
want to monitor. It is a key component of the
`LMA Toolchain of Mirantis OpenStack <https://launchpad.net/lma-toolchain>`_
as shown in the figure below:
.. note:: The Collector has evolved over time and so the term
'collector' is a little bit of a misnomer since it is
more of a **smart monitoring agent** than a mere data 'collector'.
The Collecor is a key component of the so-called
`Logging, Monitoring and Alerting toolchain of Mirantis OpenStack
<https://launchpad.net/lma-toolchain>`_ (a.k.a StackLight).
.. image:: ../../images/toolchain_map.png
:align: center
Each *collector* is individually responsible for supporting the sensing,
measurement, collection, analysis and alarm functions for the node
it is running on.
The Collector is installed on every node of your OpenStack
environment. Each Collector is individually responsible for supporting
all the monitoring functions of your OpenStack environment for both
the operating system and the services running on the node.
Note also that the Collector running on the *primary controller*
(the controller which owns the management VIP) is called the
**Aggregator** since it performs additional aggregation and correlation
functions. The Aggregator is the central point of convergence for
all the faults and anomalies detected at the node level. The
fundamental role of the Aggregator is to issue an opinion about the
health status of your OpenStack environment at the cluster
level. As such, the Collector may be viewed as a monitoring
agent for cloud infrastructure clusters.
A wealth of operational data is collected from a variety of sources including
log files, collectd and RabbitMQ for the OpenStack notifications.
The main building blocks of the Collector are:
.. note:: The *collector* which runs on the active controller of the control
plane cluster, is called the *aggregator* because it performs additional
aggregation and multivariate correlation functions to compute service
healthiness metrics at the cluster level.
* **collectd** which comes bundled with a collection of monitoring plugins.
Some of them are standard collectd plugins while others are purpose-built
plugins written in python to perform various OpenStack services checks.
* **Heka**, `a golang data processing swiss army knife by Mozilla
<https://github.com/mozilla-services/heka>`_.
Heka supports a number of standard input and output plugins
that allows to ingest data from a variety of sources
including collectd, log files and RabbitMQ,
as well as to persist the operational data to external backend servers like
Elasticsearch, InfluxDB and Nagios for search and further processing.
* **A collection of Heka plugins** written in Lua which does
the actual data processing such as running metrics transformations,
running alarms and logs parsing.
A primary function of the *collector* is to sanitise and transform the ingested
raw operational data into internal message representations using the
`Heka message structure <https://hekad.readthedocs.io/en/stable/message/index.html>`_.
This message structure is used within the *collector's* plugin framework to match,
filter and route messages to plugins written in `Lua <http://www.lua.org/>`_
that perform various data analysis and computation functions.
.. note:: An important function of the Collector is to normalize
the operational data into an internal `Heka message structure
<https://hekad.readthedocs.io/en/stable/message/index.html>`_
representation that can be ingested into the Heka's stream processing
pipeline. The stream processing pipeline uses matching policies to
route the Heka messages to the `Lua <http://www.lua.org/>`_ plugins that
will perform the actual data computation functions.
As such, the *collector* may also be described as a pluggable framework
for operational data stream processing and routing.
There are three types of Lua plugins that were developed for the Collector:
Its main building blocks are:
* The **decoder plugins** to sanitize and normalize the ingested data.
* The **filter plugins** to process the data.
* The **encoder plugins** to serialize the data that is
sent to the backend servers.
* `collectd <https://collectd.org/>`_ which is bundled with a collection of
monitoring plugins. Many of them are purpose-built for OpenStack.
* `Heka <https://github.com/mozilla-services/heka>`_ (a golang data processing
*swiss army knife* by Mozilla) which is the cornerstone technology of the Collector.
Heka supports out-of-the-box a number of input and output plugins that allows
the Collector to integrate with a number of external systems' native
protocol like Elasticsearch, InfluxDB, Nagios, SMTP, Whisper, Kafka, AMQP and
Carbon to name a few.
* A collection of Heka plugins written in Lua to decode, process and encode the
operational data.
There are five types of data sent by the Collector (and the Aggregator)
to the backend servers:
There are three types of Lua plugins running in the *collector*:
* The logs and the notifications, which are referred to as events,
sent to Elasticsearch for indexing.
* The metric's time-series sent to InfluxDB.
* The annotation sent to InfluxDB.
* The OpenStack environment clusters health status
sent as *passive checks* to Nagios
* The input plugins which collect, sanitize and transform the raw
data into an internal message representation which is injected into the
Heka pipeline for further processing.
* The filter plugins which execute the analysis and correlation functions.
* The output plugins which encode and transmit the messages to external
systems like Elasticsearch, InfluxDB or Nagios where the data can
be further processed and persisted.
The output of the *collector* and *aggregator* is of four kinds:
* The logs and notifications which are sent to Elasticsearch for indexing.
Elasticsearch combined with Kibana provides insightful log analytics.
* The metrics which are sent to InfluxDB.
InfluxDB combined with Grafana provides insightful time-series analytics.
* The health status metrics for the OpenStack clusters which are sent to Nagios
(or via SMTP) for alerting and escalation purposes.
* The annotation messages which are sent to InfluxDB. The annotation messages contain
information about what caused a service cluster or node cluster to change state.
The annotation messages provide root cause analysis hints whenever possible.
The annotation messages are also used to construct the alert notifications that are
sent via SMTP or to Nagios.
.. note:: The annotations are like notification messages
which are exposed in Grafana. They contain information about the
anomalies and faults that have been detected by the Collector.
They basicaly contain the same information as the *passive checks*
sent to Nagios. In addition, they may contain 'hints' about what
the Collector think could be the root cause of a problem.
.. _plugin_requirements:
@ -94,16 +102,9 @@ Requirements
Limitations
-----------
* The plugin is not compatible with an OpenStack environment deployed with Nova-Network.
* The Elasticsearch output plugin of the *collector* is configured to use the **drop** policy
which implies that the *collector* will start dropping the logs and the OpenStack
notifications when the output plugin has reached a buffering limit that is currently
set to 1GB by default. This situation can typically happen when the Elasticsearch server
has been inaccessible for a long period of time.
This limitation may be addressed in a future release of the StackLight Collector Plugin.
* The plugin is not compatible with an OpenStack environment deployed with nova-network.
* When you re-execute tasks on deployed nodes using the Fuel CLI, the *hekad* and
*collectd* services will be restarted on these nodes during the post-deployment
*collectd* processes will be restarted on these nodes during the post-deployment
phase. See `bug #1570850
<https://bugs.launchpad.net/lma-toolchain/+bug/1570850>`_ for details.

View File

@ -12,27 +12,27 @@ Version 0.10.0
Prior to StackLight version 0.10.0, there was one instance of the *hekad*
process running to process both the logs and the metrics. Starting with StackLight
version 0.10.0, the processing of logs and notifications is separated
from the processing of metrics into two different *hekad* instances.
This allows for better performance and flow control mechanisms when the
version 0.10.0, the processing of the logs and notifications is separated
from the processing of the metrics in two different *hekad* instances.
This allows for better performance and control of the flow when the
maximum buffer size on disk has reached a limit. With the *hekad* instance
processing the metrics, the buffering policy mandates to drop the metrics
when the maximum buffer size is reached. With the *hekad* instance
processing the logs, the buffering policy mandates to block the
entire processing pipeline. This way, one can avoid
losing logs (and notifications) in cases when the Elasticsearch
server has been inaccessible for a long period of time.
As a result, the StackLight collector has now two services running
on a node:
entire processing pipeline. This way, we can avoid
losing logs (and notifications) when the Elasticsearch
server is inaccessible for a long period of time.
As a result, the StackLight collector has now two processes running
on the node:
* The **log_collector** service
* The **metric_collector** service
* One for the *log_collector* service
* One for the *metric_collector* service
* Metrics derived from logs are now aggregated
* Metrics derived from logs are aggregated by the *log_collector* service.
To avoid flooding the *metric_collector* with bursts of metrics derived
from logs, the *log_collector* service sends aggregated metrics
by bulk to the *metric_collector* service.
from logs, the *log_collector* service sends metrics by bulk to the
*metric_collector* service.
An example of aggregated metric derived from logs is the
`openstack_<service>_http_response_time_stats
<http://fuel-plugin-lma-collector.readthedocs.io/en/latest/appendix_b.html#api-response-times>`_.
@ -41,10 +41,10 @@ Version 0.10.0
A diagnostic tool is now available to help diagnose problems.
The diagnostic tool checks that the toolchain is properly installed
and configured across the entire StackLight LMA toolchain. Please check the
the `Troubleshooting Chapter
<http://fuel-plugin-lma-collector.readthedocs.io/en/latest/configuration.html#troubleshooting>`_
of the User Guide for more information.
and configured across the entire LMA toolchain. Please check the
`Diagnostic Tool
<http://fuel-plugin-lma-collector.readthedocs.io/en/latest/configuration.html#diagnostic>`_
section of the User Guide for more information.
* Bug fixes