Separate the (L)og of the LMA collector

This change separates the processing of the logs/notifications and metric/alerting into 2 dedicated hekad processes, these services are named 'log_collector' and 'metric_collector'. Both services are managed by Pacemaker on controller nodes and by Upstart on other nodes. All metrics computed by log_collector (HTTP response times and creation time for instances and volumes) are sent directly to the metric_collector via TCP. Elasticsearch output (log_collector) uses full_action='block' and the TCP output uses full_action='drop'. All outputs of metric_collector (InfluxDB, HTTP and TCP) use full_action='drop'. The buffer size configurations are: * metric_collector: - influxdb-output buffer size is increased to 1Gb. - aggregator-output (tcp) buffer size is decreased to 256Mb (vs 1Gb). - nagios outputs (x3) buffer size are decreased to 1Mb. * log_collector: - elasticsearch-output buffer size is decreased to 256Mb (vs 1Gb). - tcp-output buffer size is set to 256Mb. Implements: blueprint separate-lma-collector-pipelines Fixes-bug: #1566748 Change-Id: Ieadb93b89f81e944e21cf8e5a65f4d683fd0ffb8
2016-03-27 22:46:52 +02:00
parent 3808d9f233
commit ebac150f8a
52 changed files with 838 additions and 446 deletions
--- a/doc/user/source/configuration.rst
+++ b/doc/user/source/configuration.rst
@@ -71,10 +71,11 @@ Plugin verification
 -------------------

 Once the OpenStack environment is ready, you may want to check that both
-the 'collectd' and 'hekad' processes of the LMA Collector are running on the OpenStack nodes::
+the 'collectd' and 'hekad' processes are running on the OpenStack nodes::

    [root@node-1 ~]# pidof hekad
    5568
+    5569
    [root@node-1 ~]# pidof collectd
    5684

@@ -85,23 +86,28 @@ Troubleshooting

 If you see no data in the Kibana and/or Grafana dashboards, use the instructions below to troubleshoot the problem:

-1. Check if the LMA Collector service is up and running::
+1. Check if LMA Collector services are up and running::

    # On the controller node(s)
-    [root@node-1 ~]# crm resource status lma_collector
+    [root@node-1 ~]# crm resource status metric_collector
+    [root@node-1 ~]# crm resource status log_collector

    # On non controller nodes
-    [root@node-1 ~]# status lma_collector
+    [root@node-2 ~]# status log_collector
+    [root@node-2 ~]# status metric_collector

-2. If the LMA Collector is down, restart it::
+2. If one of the LMA Collectors is down, restart it::

    # On the controller node(s)
-    [root@node-1 ~]# crm resource start lma_collector
+    [root@node-1 ~]# crm resource start log_collector
+    [root@node-1 ~]# crm resource start metric_collector

    # On non controller nodes
-    [root@node-1 ~]# start lma_collector
+    [root@node-2 ~]# start log_collector
+    [root@node-2 ~]# start metric_collector

-3. Look for errors in the LMA Collector log file (located at /var/log/lma_collector.log) on the different nodes.
+3. Look for errors in the LMA Collector log file (located at /var/log/log_collector.log and /var/log/metric_collector.log)
+   on the different nodes.

 4. Look for errors in the collectd log file (located at /var/log/collectd.log) on the different nodes.