Separate the (L)og of the LMA collector

This change separates the processing of the logs/notifications and
metric/alerting into 2 dedicated hekad processes, these services are
named 'log_collector' and 'metric_collector'.

Both services are managed by Pacemaker on controller nodes and by Upstart on
other nodes.

All metrics computed by log_collector (HTTP response times and creation time
for instances and volumes) are sent directly to the metric_collector via TCP.
Elasticsearch output (log_collector) uses full_action='block' and the
TCP output uses full_action='drop'.

All outputs of metric_collector (InfluxDB, HTTP and TCP) use
full_action='drop'.

The buffer size configurations are:
* metric_collector:
  - influxdb-output buffer size is increased to 1Gb.
  - aggregator-output (tcp) buffer size is decreased to 256Mb (vs 1Gb).
  - nagios outputs (x3) buffer size are decreased to 1Mb.
* log_collector:
  - elasticsearch-output buffer size is decreased to 256Mb (vs 1Gb).
  - tcp-output buffer size is set to 256Mb.

Implements: blueprint separate-lma-collector-pipelines
Fixes-bug: #1566748

Change-Id: Ieadb93b89f81e944e21cf8e5a65f4d683fd0ffb8
This commit is contained in:
Swann Croiset
2016-03-27 22:46:52 +02:00
parent 3808d9f233
commit ebac150f8a
52 changed files with 838 additions and 446 deletions

View File

@@ -71,10 +71,11 @@ Plugin verification
-------------------
Once the OpenStack environment is ready, you may want to check that both
the 'collectd' and 'hekad' processes of the LMA Collector are running on the OpenStack nodes::
the 'collectd' and 'hekad' processes are running on the OpenStack nodes::
[root@node-1 ~]# pidof hekad
5568
5569
[root@node-1 ~]# pidof collectd
5684
@@ -85,23 +86,28 @@ Troubleshooting
If you see no data in the Kibana and/or Grafana dashboards, use the instructions below to troubleshoot the problem:
1. Check if the LMA Collector service is up and running::
1. Check if LMA Collector services are up and running::
# On the controller node(s)
[root@node-1 ~]# crm resource status lma_collector
[root@node-1 ~]# crm resource status metric_collector
[root@node-1 ~]# crm resource status log_collector
# On non controller nodes
[root@node-1 ~]# status lma_collector
[root@node-2 ~]# status log_collector
[root@node-2 ~]# status metric_collector
2. If the LMA Collector is down, restart it::
2. If one of the LMA Collectors is down, restart it::
# On the controller node(s)
[root@node-1 ~]# crm resource start lma_collector
[root@node-1 ~]# crm resource start log_collector
[root@node-1 ~]# crm resource start metric_collector
# On non controller nodes
[root@node-1 ~]# start lma_collector
[root@node-2 ~]# start log_collector
[root@node-2 ~]# start metric_collector
3. Look for errors in the LMA Collector log file (located at /var/log/lma_collector.log) on the different nodes.
3. Look for errors in the LMA Collector log file (located at /var/log/log_collector.log and /var/log/metric_collector.log)
on the different nodes.
4. Look for errors in the collectd log file (located at /var/log/collectd.log) on the different nodes.