Separate the (L)og of the LMA collector
This change separates the processing of the logs/notifications and metric/alerting into 2 dedicated hekad processes, these services are named 'log_collector' and 'metric_collector'. Both services are managed by Pacemaker on controller nodes and by Upstart on other nodes. All metrics computed by log_collector (HTTP response times and creation time for instances and volumes) are sent directly to the metric_collector via TCP. Elasticsearch output (log_collector) uses full_action='block' and the TCP output uses full_action='drop'. All outputs of metric_collector (InfluxDB, HTTP and TCP) use full_action='drop'. The buffer size configurations are: * metric_collector: - influxdb-output buffer size is increased to 1Gb. - aggregator-output (tcp) buffer size is decreased to 256Mb (vs 1Gb). - nagios outputs (x3) buffer size are decreased to 1Mb. * log_collector: - elasticsearch-output buffer size is decreased to 256Mb (vs 1Gb). - tcp-output buffer size is set to 256Mb. Implements: blueprint separate-lma-collector-pipelines Fixes-bug: #1566748 Change-Id: Ieadb93b89f81e944e21cf8e5a65f4d683fd0ffb8
This commit is contained in:
@@ -71,10 +71,11 @@ Plugin verification
|
||||
-------------------
|
||||
|
||||
Once the OpenStack environment is ready, you may want to check that both
|
||||
the 'collectd' and 'hekad' processes of the LMA Collector are running on the OpenStack nodes::
|
||||
the 'collectd' and 'hekad' processes are running on the OpenStack nodes::
|
||||
|
||||
[root@node-1 ~]# pidof hekad
|
||||
5568
|
||||
5569
|
||||
[root@node-1 ~]# pidof collectd
|
||||
5684
|
||||
|
||||
@@ -85,23 +86,28 @@ Troubleshooting
|
||||
|
||||
If you see no data in the Kibana and/or Grafana dashboards, use the instructions below to troubleshoot the problem:
|
||||
|
||||
1. Check if the LMA Collector service is up and running::
|
||||
1. Check if LMA Collector services are up and running::
|
||||
|
||||
# On the controller node(s)
|
||||
[root@node-1 ~]# crm resource status lma_collector
|
||||
[root@node-1 ~]# crm resource status metric_collector
|
||||
[root@node-1 ~]# crm resource status log_collector
|
||||
|
||||
# On non controller nodes
|
||||
[root@node-1 ~]# status lma_collector
|
||||
[root@node-2 ~]# status log_collector
|
||||
[root@node-2 ~]# status metric_collector
|
||||
|
||||
2. If the LMA Collector is down, restart it::
|
||||
2. If one of the LMA Collectors is down, restart it::
|
||||
|
||||
# On the controller node(s)
|
||||
[root@node-1 ~]# crm resource start lma_collector
|
||||
[root@node-1 ~]# crm resource start log_collector
|
||||
[root@node-1 ~]# crm resource start metric_collector
|
||||
|
||||
# On non controller nodes
|
||||
[root@node-1 ~]# start lma_collector
|
||||
[root@node-2 ~]# start log_collector
|
||||
[root@node-2 ~]# start metric_collector
|
||||
|
||||
3. Look for errors in the LMA Collector log file (located at /var/log/lma_collector.log) on the different nodes.
|
||||
3. Look for errors in the LMA Collector log file (located at /var/log/log_collector.log and /var/log/metric_collector.log)
|
||||
on the different nodes.
|
||||
|
||||
4. Look for errors in the collectd log file (located at /var/log/collectd.log) on the different nodes.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user