StackLight 0.10.0 documentation updates

Change-Id: I13d1de2c2984c09e79c68301a46d35479f9afb21
2016-07-06 19:22:12 +02:00 · 2016-07-06 19:22:12 +02:00 · 0ef08aaa4a
parent 4cac7500fb
commit 0ef08aaa4a
7 changed files with 1587 additions and 421 deletions
--- a/doc/user/source/alarms.rst
+++ b/doc/user/source/alarms.rst
--- a/doc/user/source/appendix_c.rst
+++ b/doc/user/source/appendix_c.rst
@ -0,0 +1,736 @@
+.. _alarm_list:
+
+Appendix C: List of built-in alarms
+===================================
+
+Here is a list of all the alarms that are built-in in StackLight::
+
+  alarms:
+    - name: 'cpu-critical-controller'
+      description: 'The CPU usage is too high (controller node)'
+      severity: 'critical'
+      enabled: 'true'
+      trigger:
+        logical_operator: 'or'
+        rules:
+          - metric: cpu_idle
+            relational_operator: '<='
+            threshold: 5
+            window: 120
+            periods: 0
+            function: avg
+          - metric: cpu_wait
+            relational_operator: '>='
+            threshold: 35
+            window: 120
+            periods: 0
+            function: avg
+    - name: 'cpu-warning-controller'
+      description: 'The CPU usage is high (controller node)'
+      severity: 'warning'
+      enabled: 'true'
+      trigger:
+        logical_operator: 'or'
+        rules:
+          - metric: cpu_idle
+            relational_operator: '<='
+            threshold: 15
+            window: 120
+            periods: 0
+            function: avg
+          - metric: cpu_wait
+            relational_operator: '>='
+            threshold: 25
+            window: 120
+            periods: 0
+            function: avg
+    - name: 'cpu-critical-compute'
+      description: 'The CPU usage is too high (compute node)'
+      severity: 'critical'
+      enabled: 'true'
+      trigger:
+        logical_operator: 'or'
+        rules:
+          - metric: cpu_wait
+            relational_operator: '>='
+            threshold: 30
+            window: 120
+            periods: 0
+            function: avg
+    - name: 'cpu-warning-compute'
+      description: 'The CPU usage is high (compute node)'
+      severity: 'warning'
+      enabled: 'true'
+      trigger:
+        logical_operator: 'or'
+        rules:
+          - metric: cpu_wait
+            relational_operator: '>='
+            threshold: 20
+            window: 120
+            periods: 0
+            function: avg
+    - name: 'cpu-critical-rabbitmq'
+      description: 'The CPU usage is too high (RabbitMQ node)'
+      severity: 'critical'
+      enabled: 'true'
+      trigger:
+        logical_operator: 'or'
+        rules:
+          - metric: cpu_idle
+            relational_operator: '<='
+            threshold: 5
+            window: 120
+            periods: 0
+            function: avg
+    - name: 'cpu-warning-rabbitmq'
+      description: 'The CPU usage is high (RabbitMQ node)'
+      severity: 'warning'
+      enabled: 'true'
+      trigger:
+        logical_operator: 'or'
+        rules:
+          - metric: cpu_idle
+            relational_operator: '<='
+            threshold: 15
+            window: 120
+            periods: 0
+            function: avg
+    - name: 'cpu-critical-mysql'
+      description: 'The CPU usage is too high (MySQL node)'
+      severity: 'critical'
+      enabled: 'true'
+      trigger:
+        logical_operator: 'or'
+        rules:
+          - metric: cpu_idle
+            relational_operator: '<='
+            threshold: 5
+            window: 120
+            periods: 0
+            function: avg
+    - name: 'cpu-warning-mysql'
+      description: 'The CPU usage is high (MySQL node)'
+      severity: 'warning'
+      enabled: 'true'
+      trigger:
+        logical_operator: 'or'
+        rules:
+          - metric: cpu_idle
+            relational_operator: '<='
+            threshold: 15
+            window: 120
+            periods: 0
+            function: avg
+    - name: 'cpu-critical-storage'
+      description: 'The CPU usage is too high (storage node)'
+      severity: 'critical'
+      enabled: 'true'
+      trigger:
+        logical_operator: 'or'
+        rules:
+          - metric: cpu_wait
+            relational_operator: '>='
+            threshold: 40
+            window: 120
+            periods: 0
+            function: avg
+          - metric: cpu_idle
+            relational_operator: '<='
+            threshold: 5
+            window: 120
+            periods: 0
+            function: avg
+    - name: 'cpu-warning-storage'
+      description: 'The CPU usage is high (storage node)'
+      severity: 'warning'
+      enabled: 'true'
+      trigger:
+        logical_operator: 'or'
+        rules:
+          - metric: cpu_wait
+            relational_operator: '>='
+            threshold: 30
+            window: 120
+            periods: 0
+            function: avg
+          - metric: cpu_idle
+            relational_operator: '<='
+            threshold: 15
+            window: 120
+            periods: 0
+            function: avg
+    - name: 'cpu-critical-default'
+      description: 'The CPU usage is too high'
+      severity: 'critical'
+      enabled: 'true'
+      trigger:
+        logical_operator: 'or'
+        rules:
+          - metric: cpu_wait
+            relational_operator: '>='
+            threshold: 35
+            window: 120
+            periods: 0
+            function: avg
+          - metric: cpu_idle
+            relational_operator: '<='
+            threshold: 5
+            window: 120
+            periods: 0
+            function: avg
+    - name: 'rabbitmq-disk-limit-critical'
+      description: 'RabbitMQ has reached the free disk threshold. All producers are blocked'
+      severity: 'critical'
+      enabled: 'true'
+      trigger:
+        logical_operator: 'or'
+        rules:
+          - metric: rabbitmq_remaining_disk
+            relational_operator: '<='
+            threshold: 0
+            window: 20
+            periods: 0
+            function: min
+    - name: 'rabbitmq-disk-limit-warning'
+      description: 'RabbitMQ is getting close to the free disk threshold'
+      severity: 'warning'
+      enabled: 'true'
+      trigger:
+        logical_operator: 'or'
+        rules:
+          - metric: rabbitmq_remaining_disk
+            relational_operator: '<='
+            threshold: 104857600 # 100MB
+            window: 20
+            periods: 0
+            function: min
+    - name: 'rabbitmq-memory-limit-critical'
+      description: 'RabbitMQ has reached the memory threshold. All producers are blocked'
+      severity: 'critical'
+      enabled: 'true'
+      trigger:
+        logical_operator: 'or'
+        rules:
+          - metric: rabbitmq_remaining_memory
+            relational_operator: '<='
+            threshold: 0
+            window: 20
+            periods: 0
+            function: min
+    - name: 'rabbitmq-memory-limit-warning'
+      description: 'RabbitMQ is getting close to the memory threshold'
+      severity: 'warning'
+      enabled: 'true'
+      trigger:
+        logical_operator: 'or'
+        rules:
+          - metric: rabbitmq_remaining_memory
+            relational_operator: '<='
+            threshold: 104857600 # 100MB
+            window: 20
+            periods: 0
+            function: min
+    - name: 'rabbitmq-queue-warning'
+      description: 'The number of outstanding messages is too high'
+      severity: 'warning'
+      enabled: 'true'
+      trigger:
+        logical_operator: 'or'
+        rules:
+          - metric: rabbitmq_messages
+            relational_operator: '>='
+            threshold: 200
+            window: 120
+            periods: 0
+            function: avg
+    - name: 'apache-warning'
+      description: 'There is no Apache idle workers available'
+      severity: 'warning'
+      enabled: 'true'
+      trigger:
+        logical_operator: 'or'
+        rules:
+          - metric: apache_idle_workers
+            relational_operator: '=='
+            threshold: 0
+            window: 60
+            periods: 0
+            function: min
+    - name: 'log-fs-warning'
+      description: "The log filesystem's free space is low"
+      severity: 'warning'
+      enabled: 'true'
+      trigger:
+        rules:
+          - metric: fs_space_percent_free
+            fields:
+              fs: '/var/log'
+            relational_operator: '<'
+            threshold: 10 
+            window: 60
+            periods: 0
+            function: min
+    - name: 'log-fs-critical'
+      description: "The log filesystem's free space is too low"
+      severity: 'critical'
+      enabled: 'true'
+      trigger:
+        rules:
+          - metric: fs_space_percent_free
+            fields:
+              fs: '/var/log'
+            relational_operator: '<'
+            threshold: 5 
+            window: 60
+            periods: 0
+            function: min
+    - name: 'root-fs-warning'
+      description: "The root filesystem's free space is low"
+      severity: 'warning'
+      enabled: 'true'
+      trigger:
+        rules:
+          - metric: fs_space_percent_free
+            fields:
+              fs: '/'
+            relational_operator: '<'
+            threshold: 5
+            window: 60
+            periods: 0
+            function: min
+    - name: 'root-fs-critical'
+      description: "The root filesystem's free space is too low"
+      severity: 'critical'
+      enabled: 'true'
+      trigger:
+        rules:
+          - metric: fs_space_percent_free
+            fields:
+              fs: '/'
+            relational_operator: '<'
+            threshold: 2
+            window: 60
+            periods: 0
+            function: min
+    - name: 'mysql-fs-warning'
+      description: "The MySQL filesystem's free space is low"
+      severity: 'warning'
+      enabled: 'true'
+      trigger:
+        rules:
+          - metric: fs_space_percent_free
+            fields:
+              fs: '/var/lib/mysql'
+            relational_operator: '<'
+            threshold: 5
+            window: 60
+            periods: 0
+            function: min
+    - name: 'mysql-fs-critical'
+      description: "The MySQL filesystem's free space is too low"
+      severity: 'critical'
+      enabled: 'true'
+      trigger:
+        rules:
+          - metric: fs_space_percent_free
+            fields:
+              fs: '/var/lib/mysql'
+            relational_operator: '<'
+            threshold: 2
+            window: 60
+            periods: 0
+            function: min
+    - name: 'nova-fs-warning'
+      description: "The filesystem's free space is low (compute node)"
+      severity: 'warning'
+      enabled: 'true'
+      trigger:
+        rules:
+          - metric: fs_space_percent_free
+            fields:
+              fs: '/var/lib/nova'
+            relational_operator: '<'
+            threshold: 10
+            window: 60
+            periods: 0
+            function: min
+    - name: 'nova-fs-critical'
+      description: "The filesystem's free space is too low (compute node)"
+      severity: 'critical'
+      enabled: 'true'
+      trigger:
+        rules:
+          - metric: fs_space_percent_free
+            fields:
+              fs: '/var/lib/nova'
+            relational_operator: '<'
+            threshold: 5
+            window: 60
+            periods: 0
+            function: min
+    - name: 'nova-api-http-errors'
+      description: 'Too many 5xx HTTP errors have been detected on nova-api'
+      severity: 'warning'
+      enabled: 'true'
+      trigger:
+        logical_operator: 'or'
+        rules:
+          - metric: haproxy_backend_response_5xx
+            fields:
+              backend: 'nova-api'
+            relational_operator: '>'
+            threshold: 0
+            window: 60
+            periods: 1
+            function: diff
+    - name: 'nova-logs-error'
+      description: 'Too many errors have been detected in Nova logs'
+      severity: 'warning'
+      enabled: 'true'
+      trigger:
+        logical_operator: 'or'
+        rules:
+          - metric: log_messages
+            fields:
+              service: 'nova'
+              level: 'error'
+            relational_operator: '>'
+            threshold: 0.1
+            window: 70
+            periods: 0
+            function: max
+    - name: 'heat-api-http-errors'
+      description: 'Too many 5xx HTTP errors have been detected on heat-api'
+      severity: 'warning'
+      enabled: 'true'
+      trigger:
+        logical_operator: 'or'
+        rules:
+          - metric: haproxy_backend_response_5xx
+            fields:
+              backend: 'heat-api'
+            relational_operator: '>'
+            threshold: 0
+            window: 60
+            periods: 1
+            function: diff
+    - name: 'heat-logs-error'
+      description: 'Too many errors have been detected in Heat logs'
+      severity: 'warning'
+      enabled: 'true'
+      trigger:
+        logical_operator: 'or'
+        rules:
+          - metric: log_messages
+            fields:
+              service: 'heat'
+              level: 'error'
+            relational_operator: '>'
+            threshold: 0.1
+            window: 70
+            periods: 0
+            function: max
+    - name: 'swift-api-http-errors'
+      description: 'Too many 5xx HTTP errors have been detected on swift-api'
+      severity: 'warning'
+      enabled: 'true'
+      trigger:
+        logical_operator: 'or'
+        rules:
+          - metric: haproxy_backend_response_5xx
+            fields:
+              backend: 'swift-api'
+            relational_operator: '>'
+            threshold: 0
+            window: 60
+            periods: 1
+            function: diff
+    - name: 'cinder-api-http-errors'
+      description: 'Too many 5xx HTTP errors have been detected on cinder-api'
+      severity: 'warning'
+      enabled: 'true'
+      trigger:
+        logical_operator: 'or'
+        rules:
+          - metric: haproxy_backend_response_5xx
+            fields:
+              backend: 'cinder-api'
+            relational_operator: '>'
+            threshold: 0
+            window: 60
+            periods: 1
+            function: diff
+    - name: 'cinder-logs-error'
+      description: 'Too many errors have been detected in Cinder logs'
+      severity: 'warning'
+      enabled: 'true'
+      trigger:
+        logical_operator: 'or'
+        rules:
+          - metric: log_messages
+            fields:
+              service: 'cinder'
+              level: 'error'
+            relational_operator: '>'
+            threshold: 0.1
+            window: 70
+            periods: 0
+            function: max
+    - name: 'glance-api-http-errors'
+      description: 'Too many 5xx HTTP errors have been detected on glance-api'
+      severity: 'warning'
+      enabled: 'true'
+      trigger:
+        logical_operator: 'or'
+        rules:
+          - metric: haproxy_backend_response_5xx
+            fields:
+              backend: 'glance-api'
+            relational_operator: '>'
+            threshold: 0
+            window: 60
+            periods: 1
+            function: diff
+    - name: 'glance-logs-error'
+      description: 'Too many errors have been detected in Glance logs'
+      severity: 'warning'
+      enabled: 'true'
+      trigger:
+        logical_operator: 'or'
+        rules:
+          - metric: log_messages
+            fields:
+              service: 'glance'
+              level: 'error'
+            relational_operator: '>'
+            threshold: 0.1
+            window: 70
+            periods: 0
+            function: max
+    - name: 'neutron-api-http-errors'
+      description: 'Too many 5xx HTTP errors have been detected on neutron-api'
+      severity: 'warning'
+      enabled: 'true'
+      trigger:
+        logical_operator: 'or'
+        rules:
+          - metric: haproxy_backend_response_5xx
+            fields:
+              backend: 'neutron-api'
+            relational_operator: '>'
+            threshold: 0
+            window: 60
+            periods: 1
+            function: diff
+    - name: 'neutron-logs-error'
+      description: 'Too many errors have been detected in Neutron logs'
+      severity: 'warning'
+      enabled: 'true'
+      trigger:
+        logical_operator: 'or'
+        rules:
+          - metric: log_messages
+            fields:
+              service: 'neutron'
+              level: 'error'
+            relational_operator: '>'
+            threshold: 0.1
+            window: 70
+            periods: 0
+            function: max
+    - name: 'keystone-public-api-http-errors'
+      description: 'Too many 5xx HTTP errors have been detected on keystone-public-api'
+      severity: 'warning'
+      enabled: 'true'
+      trigger:
+        logical_operator: 'or'
+        rules:
+          - metric: haproxy_backend_response_5xx
+            fields:
+              backend: 'keystone-public-api'
+            relational_operator: '>'
+            threshold: 0
+            window: 60
+            periods: 1
+            function: diff
+    - name: 'keystone-admin-api-http-errors'
+      description: 'Too many 5xx HTTP errors have been detected on keystone-admin-api'
+      severity: 'warning'
+      enabled: 'true'
+      trigger:
+        logical_operator: 'or'
+        rules:
+          - metric: haproxy_backend_response_5xx
+            fields:
+              backend: 'keystone-admin-api'
+            relational_operator: '>'
+            threshold: 0
+            window: 60
+            periods: 1
+            function: diff
+    - name: 'keystone-logs-error'
+      description: 'Too many errors have been detected in Keystone logs'
+      severity: 'warning'
+      enabled: 'true'
+      trigger:
+        logical_operator: 'or'
+        rules:
+          - metric: log_messages
+            fields:
+              service: 'keystone'
+              level: 'error'
+            relational_operator: '>'
+            threshold: 0.1
+            window: 70
+            periods: 0
+            function: max
+    - name: 'mysql-node-connected'
+      description: 'The MySQL service has lost connectivity with the other nodes'
+      severity: 'critical'
+      enabled: 'true'
+      trigger:
+        logical_operator: 'or'
+        rules:
+          - metric: mysql_cluster_connected
+            relational_operator: '=='
+            threshold: 0
+            window: 30
+            periods: 1
+            function: min
+    - name: 'mysql-node-ready'
+      description: "The MySQL service isn't ready to serve queries"
+      severity: 'critical'
+      enabled: 'true'
+      trigger:
+        logical_operator: 'or'
+        rules:
+          - metric: mysql_cluster_ready
+            relational_operator: '=='
+            threshold: 0
+            window: 30
+            periods: 1
+            function: min
+    - name: 'ceph-health-critical'
+      description: 'Ceph health is critical'
+      severity: 'critical'
+      enabled: 'true'
+      trigger:
+        rules:
+          - metric: ceph_health
+            relational_operator: '=='
+            threshold: 3 # HEALTH_ERR
+            window: 60
+            function: max
+    - name: 'ceph-health-warning'
+      description: 'Ceph health is warning'
+      severity: 'warning'
+      enabled: 'true'
+      trigger:
+        rules:
+          - metric: ceph_health
+            relational_operator: '=='
+            threshold: 2 # HEALTH_WARN
+            window: 60
+            function: max
+    - name: 'ceph-capacity-critical'
+      description: 'Ceph free capacity is too low'
+      severity: 'critical'
+      enabled: 'true'
+      trigger:
+        rules:
+          - metric: ceph_pool_total_percent_free
+            relational_operator: '<'
+            threshold: 2
+            window: 60
+            function: max
+    - name: 'ceph-capacity-warning'
+      description: 'Ceph free capacity is low'
+      severity: 'warning'
+      enabled: 'true'
+      trigger:
+        rules:
+          - metric: ceph_pool_total_percent_free
+            relational_operator: '<'
+            threshold: 5
+            window: 60
+            function: max
+    - name: 'elasticsearch-health-critical'
+      description: 'Elasticsearch cluster health is critical'
+      severity: 'critical'
+      enabled: 'true'
+      trigger:
+        rules:
+          - metric: elasticsearch_cluster_health
+            relational_operator: '=='
+            threshold: 3 # red
+            window: 60
+            function: min
+    - name: 'elasticsearch-health-warning'
+      description: 'Elasticsearch health is warning'
+      severity: 'warning'
+      enabled: 'true'
+      trigger:
+        rules:
+          - metric: elasticsearch_cluster_health
+            relational_operator: '=='
+            threshold: 2 # yellow
+            window: 60
+            function: min
+    - name: 'elasticsearch-fs-warning'
+      description: "The filesystem's free space is low (Elasticsearch node)"
+      severity: 'warning'
+      enabled: 'true'
+      trigger:
+        rules:
+          - metric: fs_space_percent_free
+            fields:
+              fs: '/opt/es/data' # Real FS is /opt/es-data but Collectd substituted '/' by '-'
+            relational_operator: '<'
+            threshold: 20
+            window: 60
+            periods: 0
+            function: min
+    - name: 'elasticsearch-fs-critical'
+      description: "The filesystem's free space is too low (Elasticsearch node)"
+      severity: 'critical'
+      enabled: 'true'
+      trigger:
+        rules:
+          - metric: fs_space_percent_free
+            fields:
+              fs: '/opt/es/data' # Real FS is /opt/es-data but Collectd substituted '/' by '-'
+            relational_operator: '<'
+            threshold: 15
+            window: 60
+            periods: 0
+            function: min
+    - name: 'influxdb-fs-warning'
+      description: "The filesystem's free space is low (InfluxDB node)"
+      severity: 'warning'
+      enabled: 'true'
+      trigger:
+        rules:
+          - metric: fs_space_percent_free
+            fields:
+              fs: '/var/lib/influxdb'
+            relational_operator: '<'
+            threshold: 10
+            window: 60
+            periods: 0
+            function: min
+    - name: 'influxdb-fs-critical'
+      description: "The filesystem's free space is too low (InfluxDB node)"
+      severity: 'critical'
+      enabled: 'true'
+      trigger:
+        rules:
+          - metric: fs_space_percent_free
+            fields:
+              fs: '/var/lib/influxdb'
+            relational_operator: '<'
+            threshold: 5
+            window: 60
+            periods: 0
+            function: min
+
--- a/doc/user/source/conf.py
+++ b/doc/user/source/conf.py
@ -43,7 +43,7 @@ source_suffix = '.rst'
 master_doc = 'index'

 # General information about the project.
-project = u'The LMA Collector Plugin for Fuel'
+project = u'The StackLight Collector Plugin for Fuel'
 copyright = u'2015, Mirantis Inc.'

 # The version info for the project you're documenting, acts as replacement for
@ -198,7 +198,7 @@ latex_elements = {
 # (source start file, target name, title,
 #  author, documentclass [howto, manual, or own class]).
 latex_documents = [
-  ('index', 'LMAcollector.tex', u'The LMA Collector Plugin for Fuel Documentation',
+  ('index', 'LMAcollector.tex', u'The StackLight Collector Plugin for Fuel Documentation',
   u'Mirantis Inc.', 'manual'),
 ]

@ -228,7 +228,7 @@ latex_documents = [
 # One entry per manual page. List of tuples
 # (source start file, name, description, authors, manual section).
 man_pages = [
-    ('index', 'lmacollector', u'The LMA Collector Plugin for Fuel Documentation',
+    ('index', 'lmacollector', u'The StackLight Collector Plugin for Fuel Documentation',
     [u'Mirantis Inc.'], 1)
 ]

@ -242,7 +242,7 @@ man_pages = [
 # (source start file, target name, title, author,
 #  dir menu entry, description, category)
 texinfo_documents = [
-  ('index', 'LMAcollector', u'The LMA Collector Plugin for Fuel Documentation',
+  ('index', 'LMAcollector', u'The StackLight Collector Plugin for Fuel Documentation',
   u'Mirantis Inc.', 'LMAcollector', 'One line description of project.',
   'Miscellaneous'),
 ]
--- a/doc/user/source/configuration.rst
+++ b/doc/user/source/configuration.rst
@ -126,3 +126,65 @@ use the instructions below to troubleshoot the problem:
 5. Check if the nodes are able to connect to the Elasticsearch server on port 9200.

 6. Check if the nodes are able to connect to the InfluxDB server on port 8086.
+
+
+.. _diagnostic:
+
+Diagnostic Tool
+---------------
+
+A **global diagnostic tool** is installed on the Fuel Master node
+by the StackLight Collector Plugin. The global diagnostic tool checks
+that StackLight is configured and running properly across the entire
+LMA toolchain for all the nodes that ready in your OpenStack environment::
+
+  [root@nailgun ~]# /var/www/nailgun/plugins/lma_collector-<version>/contrib/tools/diagnostic.sh
+  Running lma_diagnostic tool on all available nodes (this can take several minutes)
+  The diagnostic archive is here: /var/lma_diagnostics.2016-06-10_11-23-1465557820.tgz
+
+Note that a global diagnostic can take several minutes.
+
+All the results are consolidated in an archive file with the
+name ``/var/lma_diagnostics.[date +%Y-%m-%d_%H-%M-%s].tgz``.
+
+Instead of running a global diagnostic, you may want to run the diagnostic
+on individual nodes. The tool will figure out what checks should be executed
+based on the role of the node as shown below::
+
+  root@node-3:~# hiera roles
+  ["controller"]
+
+  root@node-3:~# lma_diagnostics
+
+  2016-06-10-11-08-04 INFO node-3.test.domain.local role ["controller"]
+  2016-06-10-11-08-04 INFO ** LMA Collector
+  2016-06-10-11-08-04 INFO 2 process(es) 'hekad -config' found
+  2016-06-10-11-08-04 INFO 1 process(es) hekad is/are listening on port 4352
+  2016-06-10-11-08-04 INFO 1 process(es) hekad is/are listening on port 8325
+  2016-06-10-11-08-05 INFO 1 process(es) hekad is/are listening on port 5567
+  2016-06-10-11-08-05 INFO 1 process(es) hekad is/are listening on port 4353
+  [...]
+
+In the example above, the diagnostic tool reports that two *hekad*
+processes are runing on *node-3* which is the expected outcome.
+In the case where one *hekad* process is not be running, the
+diagnostic tool would report an error as shown below::
+
+  root@node-3:~# lma_diagnostics
+  2016-06-10-11-11-48 INFO node-3.test.domain.local role ["controller"]
+  2016-06-10-11-11-48 INFO ** LMA Collector
+  2016-06-10-11-11-48 **ERROR 1 'hekad -config' processes found, 2 expected!**
+  2016-06-10-11-11-48 **ERROR 'hekad' process does not LISTEN on port: 4352**
+  [...]
+
+Here, two errors are reported:
+
+  1. There is only one *hekad* process running instead of two.
+  2. No *hekad* process is listening on port 4352.
+
+This is one example of the type of checks performed by the
+diagnostic tool but there are many others.
+On the OpenStack nodes, the diagnostic's results are stored
+in ``/var/lma_diagnostics/diagnostics.log``.
+
+**A successful LMA toolchain diagnostic should be free of errors**.
--- a/doc/user/source/index.rst
+++ b/doc/user/source/index.rst
@ -13,6 +13,7 @@ Welcome to the StackLight Collector Documentation!
   licenses
   appendix_a
   appendix_b
+   appendix_c

 Indices and Tables
 ==================
--- a/doc/user/source/overview.rst
+++ b/doc/user/source/overview.rst
@ -3,76 +3,84 @@
 Overview
 ========

-The LMA Collector is the advanced monitoring agent of the
-so called Logging, Monitoring and Alerting (LMA) Toolchain of Mirantis OpenStack,
-which is now officially called the **StackLight Collector** (or just the *collector*).
+The **StackLight Collector Plugin** is used to install and configure
+several software components that are used to collect and process all the
+data that we think is relevant to provide deep operational insights about
+your OpenStack environment. These finely integrated components are
+collectively referred to as the **StackLight Collector** (or just **the Collector**).

-The StackLight Collector should be installed on each of the OpenStack nodes you
-want to monitor. It is a key component of the
-`LMA Toolchain of Mirantis OpenStack <https://launchpad.net/lma-toolchain>`_
-as shown in the figure below:
+.. note:: The Collector has evolved over time and so the term
+   'collector' is a little bit of a misnomer since it is
+   more of a **smart monitoring agent** than a mere data 'collector'.
+
+The Collecor is a key component of the so-called
+`Logging, Monitoring and Alerting toolchain of Mirantis OpenStack
+<https://launchpad.net/lma-toolchain>`_ (a.k.a StackLight).

 .. image:: ../../images/toolchain_map.png
   :align: center

-Each *collector* is individually responsible for supporting the sensing,
-measurement, collection, analysis and alarm functions for the node
-it is running on.
+The Collector is installed on every node of your OpenStack
+environment. Each Collector is individually responsible for supporting
+all the monitoring functions of your OpenStack environment for both
+the operating system and the services running on the node.
+Note also that the Collector running on the *primary controller*
+(the controller which owns the management VIP) is called the
+**Aggregator** since it performs additional aggregation and correlation
+functions. The Aggregator is the central point of convergence for
+all the faults and anomalies detected at the node level. The
+fundamental role of the Aggregator is to issue an opinion about the
+health status of your OpenStack environment at the cluster
+level. As such, the Collector may be viewed as a monitoring
+agent for cloud infrastructure clusters.

-A wealth of operational data is collected from a variety of sources including
-log files, collectd and RabbitMQ for the OpenStack notifications.
+The main building blocks of the Collector are:

-.. note:: The *collector* which runs on the active controller of the control
-   plane cluster, is called the *aggregator* because it performs additional
-   aggregation and multivariate correlation functions to compute service
-   healthiness metrics at the cluster level.
+* **collectd** which comes bundled with a collection of monitoring plugins.
+  Some of them are standard collectd plugins while others are purpose-built
+  plugins written in python to perform various OpenStack services checks.
+* **Heka**, `a golang data processing swiss army knife by Mozilla
+  <https://github.com/mozilla-services/heka>`_.
+  Heka supports a number of standard input and output plugins
+  that allows to ingest data from a variety of sources
+  including collectd, log files and RabbitMQ,
+  as well as to persist the operational data to external backend servers like
+  Elasticsearch, InfluxDB and Nagios for search and further processing.
+* **A collection of Heka plugins** written in Lua which does
+  the actual data processing such as running metrics transformations,
+  running alarms and logs parsing.

-A primary function of the *collector* is to sanitise and transform the ingested
-raw operational data into internal message representations using the
-`Heka message structure <https://hekad.readthedocs.io/en/stable/message/index.html>`_.
-This message structure is used within the *collector's* plugin framework to match,
-filter and route messages to plugins written in `Lua <http://www.lua.org/>`_
-that perform various data analysis and computation functions.
+.. note:: An important function of the Collector is to normalize
+   the operational data into an internal `Heka message structure
+   <https://hekad.readthedocs.io/en/stable/message/index.html>`_
+   representation that can be ingested into the Heka's stream processing
+   pipeline. The stream processing pipeline uses matching policies to
+   route the Heka messages to the `Lua <http://www.lua.org/>`_ plugins that
+   will perform the actual data computation functions.

-As such, the *collector* may also be described as a pluggable framework
-for operational data stream processing and routing.
+There are three types of Lua plugins that were developed for the Collector:

-Its main building blocks are:
+* The **decoder plugins** to sanitize and normalize the ingested data.
+* The **filter plugins** to process the data.
+* The **encoder plugins** to serialize the data that is
+  sent to the backend servers.

-* `collectd <https://collectd.org/>`_ which is bundled with a collection of
-  monitoring plugins. Many of them are purpose-built for OpenStack.
-* `Heka <https://github.com/mozilla-services/heka>`_ (a golang data processing
-  *swiss army knife* by Mozilla) which is the cornerstone technology of the Collector.
-  Heka supports out-of-the-box a number of input and output plugins that allows
-  the Collector to integrate with a number of external systems' native
-  protocol like Elasticsearch, InfluxDB, Nagios, SMTP, Whisper, Kafka, AMQP and
-  Carbon to name a few.
-* A collection of Heka plugins written in Lua to decode, process and encode the
-  operational data.
+There are five types of data sent by the Collector (and the Aggregator)
+to the backend servers:

-There are three types of Lua plugins running in the *collector*:
+* The logs and the notifications, which are referred to as events,
+  sent to Elasticsearch for indexing.
+* The metric's time-series sent to InfluxDB.
+* The annotation sent to InfluxDB.
+* The OpenStack environment clusters health status
+  sent as *passive checks* to Nagios

-* The input plugins which collect, sanitize and transform the raw
-  data into an internal message representation which is injected into the
-  Heka pipeline for further processing.
-* The filter plugins which execute the analysis and correlation functions.
-* The output plugins which encode and transmit the messages to external
-  systems like Elasticsearch, InfluxDB or Nagios where the data can
-  be further processed and persisted.
-
-The output of the *collector* and *aggregator* is of four kinds:
-
-* The logs and notifications which are sent to Elasticsearch for indexing.
-  Elasticsearch combined with Kibana provides insightful log analytics.
-* The metrics which are sent to InfluxDB.
-  InfluxDB combined with Grafana provides insightful time-series analytics.
-* The health status metrics for the OpenStack clusters which are sent to Nagios
-  (or via SMTP) for alerting and escalation purposes.
-* The annotation messages which are sent to InfluxDB. The annotation messages contain
-  information about what caused a service cluster or node cluster to change state.
-  The annotation messages provide root cause analysis hints whenever possible.
-  The annotation messages are also used to construct the alert notifications that are
-  sent via SMTP or to Nagios.
+.. note:: The annotations are like notification messages
+   which are exposed in Grafana. They contain information about the
+   anomalies and faults that have been detected by the Collector.
+   They basicaly contain the same information as the *passive checks*
+   sent to Nagios. In addition, they may contain 'hints' about what
+   the Collector think could be the root cause of a problem.

 .. _plugin_requirements:

@ -94,16 +102,9 @@ Requirements
 Limitations
 -----------

-* The plugin is not compatible with an OpenStack environment deployed with Nova-Network.
-
-* The Elasticsearch output plugin of the *collector* is configured to use the **drop** policy
-  which implies that the *collector* will start dropping the logs and the OpenStack
-  notifications when the output plugin has reached a buffering limit that is currently
-  set to 1GB by default. This situation can typically happen when the Elasticsearch server
-  has been inaccessible for a long period of time.
-  This limitation may be addressed in a future release of the StackLight Collector Plugin.
+* The plugin is not compatible with an OpenStack environment deployed with nova-network.

 * When you re-execute tasks on deployed nodes using the Fuel CLI, the *hekad* and
-  *collectd* services will be restarted on these nodes during the post-deployment
+  *collectd* processes will be restarted on these nodes during the post-deployment
  phase. See `bug #1570850
  <https://bugs.launchpad.net/lma-toolchain/+bug/1570850>`_ for details.
--- a/doc/user/source/releases.rst
+++ b/doc/user/source/releases.rst
@ -12,27 +12,27 @@ Version 0.10.0

    Prior to StackLight version 0.10.0, there was one instance of the *hekad*
    process running to process both the logs and the metrics. Starting with StackLight
-    version 0.10.0, the processing of logs and notifications is separated
-    from the processing of metrics into two different *hekad* instances.
-    This allows for better performance and flow control mechanisms when the
+    version 0.10.0, the processing of the logs and notifications is separated
+    from the processing of the metrics in two different *hekad* instances.
+    This allows for better performance and control of the flow when the
    maximum buffer size on disk has reached a limit. With the *hekad* instance
    processing the metrics, the buffering policy mandates to drop the metrics
    when the maximum buffer size is reached. With the *hekad* instance
    processing the logs, the buffering policy mandates to block the
-    entire processing pipeline. This way, one can avoid
-    losing logs (and notifications) in cases when the Elasticsearch
-    server has been inaccessible for a long period of time.
-    As a result, the StackLight collector has now two services running
-    on a node:
+    entire processing pipeline. This way, we can avoid
+    losing logs (and notifications) when the Elasticsearch
+    server is inaccessible for a long period of time.
+    As a result, the StackLight collector has now two processes running
+    on the node:

-    * The **log_collector** service
-    * The **metric_collector** service
+    * One for the *log_collector* service
+    * One for the *metric_collector* service

-  * Metrics derived from logs are now aggregated
+  * Metrics derived from logs are aggregated by the *log_collector* service.

    To avoid flooding the *metric_collector* with bursts of metrics derived
-    from logs, the *log_collector* service sends aggregated metrics
-    by bulk to the *metric_collector* service.
+    from logs, the *log_collector* service sends metrics by bulk to the
+    *metric_collector* service.
    An example of aggregated metric derived from logs is the
    `openstack_<service>_http_response_time_stats
    <http://fuel-plugin-lma-collector.readthedocs.io/en/latest/appendix_b.html#api-response-times>`_.
@ -41,10 +41,10 @@ Version 0.10.0

    A diagnostic tool is now available to help diagnose problems.
    The diagnostic tool checks that the toolchain is properly installed
-    and configured across the entire StackLight LMA toolchain. Please check the
-    the `Troubleshooting Chapter
-    <http://fuel-plugin-lma-collector.readthedocs.io/en/latest/configuration.html#troubleshooting>`_
-    of the User Guide for more information.
+    and configured across the entire LMA toolchain. Please check the
+    `Diagnostic Tool
+    <http://fuel-plugin-lma-collector.readthedocs.io/en/latest/configuration.html#diagnostic>`_
+    section of the User Guide for more information.

 * Bug fixes