StackLight 0.10.0 documentation updates
Change-Id: I13d1de2c2984c09e79c68301a46d35479f9afb21
This commit is contained in:
parent
4cac7500fb
commit
0ef08aaa4a
File diff suppressed because it is too large
Load Diff
|
@ -0,0 +1,736 @@
|
|||
.. _alarm_list:
|
||||
|
||||
Appendix C: List of built-in alarms
|
||||
===================================
|
||||
|
||||
Here is a list of all the alarms that are built-in in StackLight::
|
||||
|
||||
alarms:
|
||||
- name: 'cpu-critical-controller'
|
||||
description: 'The CPU usage is too high (controller node)'
|
||||
severity: 'critical'
|
||||
enabled: 'true'
|
||||
trigger:
|
||||
logical_operator: 'or'
|
||||
rules:
|
||||
- metric: cpu_idle
|
||||
relational_operator: '<='
|
||||
threshold: 5
|
||||
window: 120
|
||||
periods: 0
|
||||
function: avg
|
||||
- metric: cpu_wait
|
||||
relational_operator: '>='
|
||||
threshold: 35
|
||||
window: 120
|
||||
periods: 0
|
||||
function: avg
|
||||
- name: 'cpu-warning-controller'
|
||||
description: 'The CPU usage is high (controller node)'
|
||||
severity: 'warning'
|
||||
enabled: 'true'
|
||||
trigger:
|
||||
logical_operator: 'or'
|
||||
rules:
|
||||
- metric: cpu_idle
|
||||
relational_operator: '<='
|
||||
threshold: 15
|
||||
window: 120
|
||||
periods: 0
|
||||
function: avg
|
||||
- metric: cpu_wait
|
||||
relational_operator: '>='
|
||||
threshold: 25
|
||||
window: 120
|
||||
periods: 0
|
||||
function: avg
|
||||
- name: 'cpu-critical-compute'
|
||||
description: 'The CPU usage is too high (compute node)'
|
||||
severity: 'critical'
|
||||
enabled: 'true'
|
||||
trigger:
|
||||
logical_operator: 'or'
|
||||
rules:
|
||||
- metric: cpu_wait
|
||||
relational_operator: '>='
|
||||
threshold: 30
|
||||
window: 120
|
||||
periods: 0
|
||||
function: avg
|
||||
- name: 'cpu-warning-compute'
|
||||
description: 'The CPU usage is high (compute node)'
|
||||
severity: 'warning'
|
||||
enabled: 'true'
|
||||
trigger:
|
||||
logical_operator: 'or'
|
||||
rules:
|
||||
- metric: cpu_wait
|
||||
relational_operator: '>='
|
||||
threshold: 20
|
||||
window: 120
|
||||
periods: 0
|
||||
function: avg
|
||||
- name: 'cpu-critical-rabbitmq'
|
||||
description: 'The CPU usage is too high (RabbitMQ node)'
|
||||
severity: 'critical'
|
||||
enabled: 'true'
|
||||
trigger:
|
||||
logical_operator: 'or'
|
||||
rules:
|
||||
- metric: cpu_idle
|
||||
relational_operator: '<='
|
||||
threshold: 5
|
||||
window: 120
|
||||
periods: 0
|
||||
function: avg
|
||||
- name: 'cpu-warning-rabbitmq'
|
||||
description: 'The CPU usage is high (RabbitMQ node)'
|
||||
severity: 'warning'
|
||||
enabled: 'true'
|
||||
trigger:
|
||||
logical_operator: 'or'
|
||||
rules:
|
||||
- metric: cpu_idle
|
||||
relational_operator: '<='
|
||||
threshold: 15
|
||||
window: 120
|
||||
periods: 0
|
||||
function: avg
|
||||
- name: 'cpu-critical-mysql'
|
||||
description: 'The CPU usage is too high (MySQL node)'
|
||||
severity: 'critical'
|
||||
enabled: 'true'
|
||||
trigger:
|
||||
logical_operator: 'or'
|
||||
rules:
|
||||
- metric: cpu_idle
|
||||
relational_operator: '<='
|
||||
threshold: 5
|
||||
window: 120
|
||||
periods: 0
|
||||
function: avg
|
||||
- name: 'cpu-warning-mysql'
|
||||
description: 'The CPU usage is high (MySQL node)'
|
||||
severity: 'warning'
|
||||
enabled: 'true'
|
||||
trigger:
|
||||
logical_operator: 'or'
|
||||
rules:
|
||||
- metric: cpu_idle
|
||||
relational_operator: '<='
|
||||
threshold: 15
|
||||
window: 120
|
||||
periods: 0
|
||||
function: avg
|
||||
- name: 'cpu-critical-storage'
|
||||
description: 'The CPU usage is too high (storage node)'
|
||||
severity: 'critical'
|
||||
enabled: 'true'
|
||||
trigger:
|
||||
logical_operator: 'or'
|
||||
rules:
|
||||
- metric: cpu_wait
|
||||
relational_operator: '>='
|
||||
threshold: 40
|
||||
window: 120
|
||||
periods: 0
|
||||
function: avg
|
||||
- metric: cpu_idle
|
||||
relational_operator: '<='
|
||||
threshold: 5
|
||||
window: 120
|
||||
periods: 0
|
||||
function: avg
|
||||
- name: 'cpu-warning-storage'
|
||||
description: 'The CPU usage is high (storage node)'
|
||||
severity: 'warning'
|
||||
enabled: 'true'
|
||||
trigger:
|
||||
logical_operator: 'or'
|
||||
rules:
|
||||
- metric: cpu_wait
|
||||
relational_operator: '>='
|
||||
threshold: 30
|
||||
window: 120
|
||||
periods: 0
|
||||
function: avg
|
||||
- metric: cpu_idle
|
||||
relational_operator: '<='
|
||||
threshold: 15
|
||||
window: 120
|
||||
periods: 0
|
||||
function: avg
|
||||
- name: 'cpu-critical-default'
|
||||
description: 'The CPU usage is too high'
|
||||
severity: 'critical'
|
||||
enabled: 'true'
|
||||
trigger:
|
||||
logical_operator: 'or'
|
||||
rules:
|
||||
- metric: cpu_wait
|
||||
relational_operator: '>='
|
||||
threshold: 35
|
||||
window: 120
|
||||
periods: 0
|
||||
function: avg
|
||||
- metric: cpu_idle
|
||||
relational_operator: '<='
|
||||
threshold: 5
|
||||
window: 120
|
||||
periods: 0
|
||||
function: avg
|
||||
- name: 'rabbitmq-disk-limit-critical'
|
||||
description: 'RabbitMQ has reached the free disk threshold. All producers are blocked'
|
||||
severity: 'critical'
|
||||
enabled: 'true'
|
||||
trigger:
|
||||
logical_operator: 'or'
|
||||
rules:
|
||||
- metric: rabbitmq_remaining_disk
|
||||
relational_operator: '<='
|
||||
threshold: 0
|
||||
window: 20
|
||||
periods: 0
|
||||
function: min
|
||||
- name: 'rabbitmq-disk-limit-warning'
|
||||
description: 'RabbitMQ is getting close to the free disk threshold'
|
||||
severity: 'warning'
|
||||
enabled: 'true'
|
||||
trigger:
|
||||
logical_operator: 'or'
|
||||
rules:
|
||||
- metric: rabbitmq_remaining_disk
|
||||
relational_operator: '<='
|
||||
threshold: 104857600 # 100MB
|
||||
window: 20
|
||||
periods: 0
|
||||
function: min
|
||||
- name: 'rabbitmq-memory-limit-critical'
|
||||
description: 'RabbitMQ has reached the memory threshold. All producers are blocked'
|
||||
severity: 'critical'
|
||||
enabled: 'true'
|
||||
trigger:
|
||||
logical_operator: 'or'
|
||||
rules:
|
||||
- metric: rabbitmq_remaining_memory
|
||||
relational_operator: '<='
|
||||
threshold: 0
|
||||
window: 20
|
||||
periods: 0
|
||||
function: min
|
||||
- name: 'rabbitmq-memory-limit-warning'
|
||||
description: 'RabbitMQ is getting close to the memory threshold'
|
||||
severity: 'warning'
|
||||
enabled: 'true'
|
||||
trigger:
|
||||
logical_operator: 'or'
|
||||
rules:
|
||||
- metric: rabbitmq_remaining_memory
|
||||
relational_operator: '<='
|
||||
threshold: 104857600 # 100MB
|
||||
window: 20
|
||||
periods: 0
|
||||
function: min
|
||||
- name: 'rabbitmq-queue-warning'
|
||||
description: 'The number of outstanding messages is too high'
|
||||
severity: 'warning'
|
||||
enabled: 'true'
|
||||
trigger:
|
||||
logical_operator: 'or'
|
||||
rules:
|
||||
- metric: rabbitmq_messages
|
||||
relational_operator: '>='
|
||||
threshold: 200
|
||||
window: 120
|
||||
periods: 0
|
||||
function: avg
|
||||
- name: 'apache-warning'
|
||||
description: 'There is no Apache idle workers available'
|
||||
severity: 'warning'
|
||||
enabled: 'true'
|
||||
trigger:
|
||||
logical_operator: 'or'
|
||||
rules:
|
||||
- metric: apache_idle_workers
|
||||
relational_operator: '=='
|
||||
threshold: 0
|
||||
window: 60
|
||||
periods: 0
|
||||
function: min
|
||||
- name: 'log-fs-warning'
|
||||
description: "The log filesystem's free space is low"
|
||||
severity: 'warning'
|
||||
enabled: 'true'
|
||||
trigger:
|
||||
rules:
|
||||
- metric: fs_space_percent_free
|
||||
fields:
|
||||
fs: '/var/log'
|
||||
relational_operator: '<'
|
||||
threshold: 10
|
||||
window: 60
|
||||
periods: 0
|
||||
function: min
|
||||
- name: 'log-fs-critical'
|
||||
description: "The log filesystem's free space is too low"
|
||||
severity: 'critical'
|
||||
enabled: 'true'
|
||||
trigger:
|
||||
rules:
|
||||
- metric: fs_space_percent_free
|
||||
fields:
|
||||
fs: '/var/log'
|
||||
relational_operator: '<'
|
||||
threshold: 5
|
||||
window: 60
|
||||
periods: 0
|
||||
function: min
|
||||
- name: 'root-fs-warning'
|
||||
description: "The root filesystem's free space is low"
|
||||
severity: 'warning'
|
||||
enabled: 'true'
|
||||
trigger:
|
||||
rules:
|
||||
- metric: fs_space_percent_free
|
||||
fields:
|
||||
fs: '/'
|
||||
relational_operator: '<'
|
||||
threshold: 5
|
||||
window: 60
|
||||
periods: 0
|
||||
function: min
|
||||
- name: 'root-fs-critical'
|
||||
description: "The root filesystem's free space is too low"
|
||||
severity: 'critical'
|
||||
enabled: 'true'
|
||||
trigger:
|
||||
rules:
|
||||
- metric: fs_space_percent_free
|
||||
fields:
|
||||
fs: '/'
|
||||
relational_operator: '<'
|
||||
threshold: 2
|
||||
window: 60
|
||||
periods: 0
|
||||
function: min
|
||||
- name: 'mysql-fs-warning'
|
||||
description: "The MySQL filesystem's free space is low"
|
||||
severity: 'warning'
|
||||
enabled: 'true'
|
||||
trigger:
|
||||
rules:
|
||||
- metric: fs_space_percent_free
|
||||
fields:
|
||||
fs: '/var/lib/mysql'
|
||||
relational_operator: '<'
|
||||
threshold: 5
|
||||
window: 60
|
||||
periods: 0
|
||||
function: min
|
||||
- name: 'mysql-fs-critical'
|
||||
description: "The MySQL filesystem's free space is too low"
|
||||
severity: 'critical'
|
||||
enabled: 'true'
|
||||
trigger:
|
||||
rules:
|
||||
- metric: fs_space_percent_free
|
||||
fields:
|
||||
fs: '/var/lib/mysql'
|
||||
relational_operator: '<'
|
||||
threshold: 2
|
||||
window: 60
|
||||
periods: 0
|
||||
function: min
|
||||
- name: 'nova-fs-warning'
|
||||
description: "The filesystem's free space is low (compute node)"
|
||||
severity: 'warning'
|
||||
enabled: 'true'
|
||||
trigger:
|
||||
rules:
|
||||
- metric: fs_space_percent_free
|
||||
fields:
|
||||
fs: '/var/lib/nova'
|
||||
relational_operator: '<'
|
||||
threshold: 10
|
||||
window: 60
|
||||
periods: 0
|
||||
function: min
|
||||
- name: 'nova-fs-critical'
|
||||
description: "The filesystem's free space is too low (compute node)"
|
||||
severity: 'critical'
|
||||
enabled: 'true'
|
||||
trigger:
|
||||
rules:
|
||||
- metric: fs_space_percent_free
|
||||
fields:
|
||||
fs: '/var/lib/nova'
|
||||
relational_operator: '<'
|
||||
threshold: 5
|
||||
window: 60
|
||||
periods: 0
|
||||
function: min
|
||||
- name: 'nova-api-http-errors'
|
||||
description: 'Too many 5xx HTTP errors have been detected on nova-api'
|
||||
severity: 'warning'
|
||||
enabled: 'true'
|
||||
trigger:
|
||||
logical_operator: 'or'
|
||||
rules:
|
||||
- metric: haproxy_backend_response_5xx
|
||||
fields:
|
||||
backend: 'nova-api'
|
||||
relational_operator: '>'
|
||||
threshold: 0
|
||||
window: 60
|
||||
periods: 1
|
||||
function: diff
|
||||
- name: 'nova-logs-error'
|
||||
description: 'Too many errors have been detected in Nova logs'
|
||||
severity: 'warning'
|
||||
enabled: 'true'
|
||||
trigger:
|
||||
logical_operator: 'or'
|
||||
rules:
|
||||
- metric: log_messages
|
||||
fields:
|
||||
service: 'nova'
|
||||
level: 'error'
|
||||
relational_operator: '>'
|
||||
threshold: 0.1
|
||||
window: 70
|
||||
periods: 0
|
||||
function: max
|
||||
- name: 'heat-api-http-errors'
|
||||
description: 'Too many 5xx HTTP errors have been detected on heat-api'
|
||||
severity: 'warning'
|
||||
enabled: 'true'
|
||||
trigger:
|
||||
logical_operator: 'or'
|
||||
rules:
|
||||
- metric: haproxy_backend_response_5xx
|
||||
fields:
|
||||
backend: 'heat-api'
|
||||
relational_operator: '>'
|
||||
threshold: 0
|
||||
window: 60
|
||||
periods: 1
|
||||
function: diff
|
||||
- name: 'heat-logs-error'
|
||||
description: 'Too many errors have been detected in Heat logs'
|
||||
severity: 'warning'
|
||||
enabled: 'true'
|
||||
trigger:
|
||||
logical_operator: 'or'
|
||||
rules:
|
||||
- metric: log_messages
|
||||
fields:
|
||||
service: 'heat'
|
||||
level: 'error'
|
||||
relational_operator: '>'
|
||||
threshold: 0.1
|
||||
window: 70
|
||||
periods: 0
|
||||
function: max
|
||||
- name: 'swift-api-http-errors'
|
||||
description: 'Too many 5xx HTTP errors have been detected on swift-api'
|
||||
severity: 'warning'
|
||||
enabled: 'true'
|
||||
trigger:
|
||||
logical_operator: 'or'
|
||||
rules:
|
||||
- metric: haproxy_backend_response_5xx
|
||||
fields:
|
||||
backend: 'swift-api'
|
||||
relational_operator: '>'
|
||||
threshold: 0
|
||||
window: 60
|
||||
periods: 1
|
||||
function: diff
|
||||
- name: 'cinder-api-http-errors'
|
||||
description: 'Too many 5xx HTTP errors have been detected on cinder-api'
|
||||
severity: 'warning'
|
||||
enabled: 'true'
|
||||
trigger:
|
||||
logical_operator: 'or'
|
||||
rules:
|
||||
- metric: haproxy_backend_response_5xx
|
||||
fields:
|
||||
backend: 'cinder-api'
|
||||
relational_operator: '>'
|
||||
threshold: 0
|
||||
window: 60
|
||||
periods: 1
|
||||
function: diff
|
||||
- name: 'cinder-logs-error'
|
||||
description: 'Too many errors have been detected in Cinder logs'
|
||||
severity: 'warning'
|
||||
enabled: 'true'
|
||||
trigger:
|
||||
logical_operator: 'or'
|
||||
rules:
|
||||
- metric: log_messages
|
||||
fields:
|
||||
service: 'cinder'
|
||||
level: 'error'
|
||||
relational_operator: '>'
|
||||
threshold: 0.1
|
||||
window: 70
|
||||
periods: 0
|
||||
function: max
|
||||
- name: 'glance-api-http-errors'
|
||||
description: 'Too many 5xx HTTP errors have been detected on glance-api'
|
||||
severity: 'warning'
|
||||
enabled: 'true'
|
||||
trigger:
|
||||
logical_operator: 'or'
|
||||
rules:
|
||||
- metric: haproxy_backend_response_5xx
|
||||
fields:
|
||||
backend: 'glance-api'
|
||||
relational_operator: '>'
|
||||
threshold: 0
|
||||
window: 60
|
||||
periods: 1
|
||||
function: diff
|
||||
- name: 'glance-logs-error'
|
||||
description: 'Too many errors have been detected in Glance logs'
|
||||
severity: 'warning'
|
||||
enabled: 'true'
|
||||
trigger:
|
||||
logical_operator: 'or'
|
||||
rules:
|
||||
- metric: log_messages
|
||||
fields:
|
||||
service: 'glance'
|
||||
level: 'error'
|
||||
relational_operator: '>'
|
||||
threshold: 0.1
|
||||
window: 70
|
||||
periods: 0
|
||||
function: max
|
||||
- name: 'neutron-api-http-errors'
|
||||
description: 'Too many 5xx HTTP errors have been detected on neutron-api'
|
||||
severity: 'warning'
|
||||
enabled: 'true'
|
||||
trigger:
|
||||
logical_operator: 'or'
|
||||
rules:
|
||||
- metric: haproxy_backend_response_5xx
|
||||
fields:
|
||||
backend: 'neutron-api'
|
||||
relational_operator: '>'
|
||||
threshold: 0
|
||||
window: 60
|
||||
periods: 1
|
||||
function: diff
|
||||
- name: 'neutron-logs-error'
|
||||
description: 'Too many errors have been detected in Neutron logs'
|
||||
severity: 'warning'
|
||||
enabled: 'true'
|
||||
trigger:
|
||||
logical_operator: 'or'
|
||||
rules:
|
||||
- metric: log_messages
|
||||
fields:
|
||||
service: 'neutron'
|
||||
level: 'error'
|
||||
relational_operator: '>'
|
||||
threshold: 0.1
|
||||
window: 70
|
||||
periods: 0
|
||||
function: max
|
||||
- name: 'keystone-public-api-http-errors'
|
||||
description: 'Too many 5xx HTTP errors have been detected on keystone-public-api'
|
||||
severity: 'warning'
|
||||
enabled: 'true'
|
||||
trigger:
|
||||
logical_operator: 'or'
|
||||
rules:
|
||||
- metric: haproxy_backend_response_5xx
|
||||
fields:
|
||||
backend: 'keystone-public-api'
|
||||
relational_operator: '>'
|
||||
threshold: 0
|
||||
window: 60
|
||||
periods: 1
|
||||
function: diff
|
||||
- name: 'keystone-admin-api-http-errors'
|
||||
description: 'Too many 5xx HTTP errors have been detected on keystone-admin-api'
|
||||
severity: 'warning'
|
||||
enabled: 'true'
|
||||
trigger:
|
||||
logical_operator: 'or'
|
||||
rules:
|
||||
- metric: haproxy_backend_response_5xx
|
||||
fields:
|
||||
backend: 'keystone-admin-api'
|
||||
relational_operator: '>'
|
||||
threshold: 0
|
||||
window: 60
|
||||
periods: 1
|
||||
function: diff
|
||||
- name: 'keystone-logs-error'
|
||||
description: 'Too many errors have been detected in Keystone logs'
|
||||
severity: 'warning'
|
||||
enabled: 'true'
|
||||
trigger:
|
||||
logical_operator: 'or'
|
||||
rules:
|
||||
- metric: log_messages
|
||||
fields:
|
||||
service: 'keystone'
|
||||
level: 'error'
|
||||
relational_operator: '>'
|
||||
threshold: 0.1
|
||||
window: 70
|
||||
periods: 0
|
||||
function: max
|
||||
- name: 'mysql-node-connected'
|
||||
description: 'The MySQL service has lost connectivity with the other nodes'
|
||||
severity: 'critical'
|
||||
enabled: 'true'
|
||||
trigger:
|
||||
logical_operator: 'or'
|
||||
rules:
|
||||
- metric: mysql_cluster_connected
|
||||
relational_operator: '=='
|
||||
threshold: 0
|
||||
window: 30
|
||||
periods: 1
|
||||
function: min
|
||||
- name: 'mysql-node-ready'
|
||||
description: "The MySQL service isn't ready to serve queries"
|
||||
severity: 'critical'
|
||||
enabled: 'true'
|
||||
trigger:
|
||||
logical_operator: 'or'
|
||||
rules:
|
||||
- metric: mysql_cluster_ready
|
||||
relational_operator: '=='
|
||||
threshold: 0
|
||||
window: 30
|
||||
periods: 1
|
||||
function: min
|
||||
- name: 'ceph-health-critical'
|
||||
description: 'Ceph health is critical'
|
||||
severity: 'critical'
|
||||
enabled: 'true'
|
||||
trigger:
|
||||
rules:
|
||||
- metric: ceph_health
|
||||
relational_operator: '=='
|
||||
threshold: 3 # HEALTH_ERR
|
||||
window: 60
|
||||
function: max
|
||||
- name: 'ceph-health-warning'
|
||||
description: 'Ceph health is warning'
|
||||
severity: 'warning'
|
||||
enabled: 'true'
|
||||
trigger:
|
||||
rules:
|
||||
- metric: ceph_health
|
||||
relational_operator: '=='
|
||||
threshold: 2 # HEALTH_WARN
|
||||
window: 60
|
||||
function: max
|
||||
- name: 'ceph-capacity-critical'
|
||||
description: 'Ceph free capacity is too low'
|
||||
severity: 'critical'
|
||||
enabled: 'true'
|
||||
trigger:
|
||||
rules:
|
||||
- metric: ceph_pool_total_percent_free
|
||||
relational_operator: '<'
|
||||
threshold: 2
|
||||
window: 60
|
||||
function: max
|
||||
- name: 'ceph-capacity-warning'
|
||||
description: 'Ceph free capacity is low'
|
||||
severity: 'warning'
|
||||
enabled: 'true'
|
||||
trigger:
|
||||
rules:
|
||||
- metric: ceph_pool_total_percent_free
|
||||
relational_operator: '<'
|
||||
threshold: 5
|
||||
window: 60
|
||||
function: max
|
||||
- name: 'elasticsearch-health-critical'
|
||||
description: 'Elasticsearch cluster health is critical'
|
||||
severity: 'critical'
|
||||
enabled: 'true'
|
||||
trigger:
|
||||
rules:
|
||||
- metric: elasticsearch_cluster_health
|
||||
relational_operator: '=='
|
||||
threshold: 3 # red
|
||||
window: 60
|
||||
function: min
|
||||
- name: 'elasticsearch-health-warning'
|
||||
description: 'Elasticsearch health is warning'
|
||||
severity: 'warning'
|
||||
enabled: 'true'
|
||||
trigger:
|
||||
rules:
|
||||
- metric: elasticsearch_cluster_health
|
||||
relational_operator: '=='
|
||||
threshold: 2 # yellow
|
||||
window: 60
|
||||
function: min
|
||||
- name: 'elasticsearch-fs-warning'
|
||||
description: "The filesystem's free space is low (Elasticsearch node)"
|
||||
severity: 'warning'
|
||||
enabled: 'true'
|
||||
trigger:
|
||||
rules:
|
||||
- metric: fs_space_percent_free
|
||||
fields:
|
||||
fs: '/opt/es/data' # Real FS is /opt/es-data but Collectd substituted '/' by '-'
|
||||
relational_operator: '<'
|
||||
threshold: 20
|
||||
window: 60
|
||||
periods: 0
|
||||
function: min
|
||||
- name: 'elasticsearch-fs-critical'
|
||||
description: "The filesystem's free space is too low (Elasticsearch node)"
|
||||
severity: 'critical'
|
||||
enabled: 'true'
|
||||
trigger:
|
||||
rules:
|
||||
- metric: fs_space_percent_free
|
||||
fields:
|
||||
fs: '/opt/es/data' # Real FS is /opt/es-data but Collectd substituted '/' by '-'
|
||||
relational_operator: '<'
|
||||
threshold: 15
|
||||
window: 60
|
||||
periods: 0
|
||||
function: min
|
||||
- name: 'influxdb-fs-warning'
|
||||
description: "The filesystem's free space is low (InfluxDB node)"
|
||||
severity: 'warning'
|
||||
enabled: 'true'
|
||||
trigger:
|
||||
rules:
|
||||
- metric: fs_space_percent_free
|
||||
fields:
|
||||
fs: '/var/lib/influxdb'
|
||||
relational_operator: '<'
|
||||
threshold: 10
|
||||
window: 60
|
||||
periods: 0
|
||||
function: min
|
||||
- name: 'influxdb-fs-critical'
|
||||
description: "The filesystem's free space is too low (InfluxDB node)"
|
||||
severity: 'critical'
|
||||
enabled: 'true'
|
||||
trigger:
|
||||
rules:
|
||||
- metric: fs_space_percent_free
|
||||
fields:
|
||||
fs: '/var/lib/influxdb'
|
||||
relational_operator: '<'
|
||||
threshold: 5
|
||||
window: 60
|
||||
periods: 0
|
||||
function: min
|
||||
|
|
@ -43,7 +43,7 @@ source_suffix = '.rst'
|
|||
master_doc = 'index'
|
||||
|
||||
# General information about the project.
|
||||
project = u'The LMA Collector Plugin for Fuel'
|
||||
project = u'The StackLight Collector Plugin for Fuel'
|
||||
copyright = u'2015, Mirantis Inc.'
|
||||
|
||||
# The version info for the project you're documenting, acts as replacement for
|
||||
|
@ -198,7 +198,7 @@ latex_elements = {
|
|||
# (source start file, target name, title,
|
||||
# author, documentclass [howto, manual, or own class]).
|
||||
latex_documents = [
|
||||
('index', 'LMAcollector.tex', u'The LMA Collector Plugin for Fuel Documentation',
|
||||
('index', 'LMAcollector.tex', u'The StackLight Collector Plugin for Fuel Documentation',
|
||||
u'Mirantis Inc.', 'manual'),
|
||||
]
|
||||
|
||||
|
@ -228,7 +228,7 @@ latex_documents = [
|
|||
# One entry per manual page. List of tuples
|
||||
# (source start file, name, description, authors, manual section).
|
||||
man_pages = [
|
||||
('index', 'lmacollector', u'The LMA Collector Plugin for Fuel Documentation',
|
||||
('index', 'lmacollector', u'The StackLight Collector Plugin for Fuel Documentation',
|
||||
[u'Mirantis Inc.'], 1)
|
||||
]
|
||||
|
||||
|
@ -242,7 +242,7 @@ man_pages = [
|
|||
# (source start file, target name, title, author,
|
||||
# dir menu entry, description, category)
|
||||
texinfo_documents = [
|
||||
('index', 'LMAcollector', u'The LMA Collector Plugin for Fuel Documentation',
|
||||
('index', 'LMAcollector', u'The StackLight Collector Plugin for Fuel Documentation',
|
||||
u'Mirantis Inc.', 'LMAcollector', 'One line description of project.',
|
||||
'Miscellaneous'),
|
||||
]
|
||||
|
|
|
@ -126,3 +126,65 @@ use the instructions below to troubleshoot the problem:
|
|||
5. Check if the nodes are able to connect to the Elasticsearch server on port 9200.
|
||||
|
||||
6. Check if the nodes are able to connect to the InfluxDB server on port 8086.
|
||||
|
||||
|
||||
.. _diagnostic:
|
||||
|
||||
Diagnostic Tool
|
||||
---------------
|
||||
|
||||
A **global diagnostic tool** is installed on the Fuel Master node
|
||||
by the StackLight Collector Plugin. The global diagnostic tool checks
|
||||
that StackLight is configured and running properly across the entire
|
||||
LMA toolchain for all the nodes that ready in your OpenStack environment::
|
||||
|
||||
[root@nailgun ~]# /var/www/nailgun/plugins/lma_collector-<version>/contrib/tools/diagnostic.sh
|
||||
Running lma_diagnostic tool on all available nodes (this can take several minutes)
|
||||
The diagnostic archive is here: /var/lma_diagnostics.2016-06-10_11-23-1465557820.tgz
|
||||
|
||||
Note that a global diagnostic can take several minutes.
|
||||
|
||||
All the results are consolidated in an archive file with the
|
||||
name ``/var/lma_diagnostics.[date +%Y-%m-%d_%H-%M-%s].tgz``.
|
||||
|
||||
Instead of running a global diagnostic, you may want to run the diagnostic
|
||||
on individual nodes. The tool will figure out what checks should be executed
|
||||
based on the role of the node as shown below::
|
||||
|
||||
root@node-3:~# hiera roles
|
||||
["controller"]
|
||||
|
||||
root@node-3:~# lma_diagnostics
|
||||
|
||||
2016-06-10-11-08-04 INFO node-3.test.domain.local role ["controller"]
|
||||
2016-06-10-11-08-04 INFO ** LMA Collector
|
||||
2016-06-10-11-08-04 INFO 2 process(es) 'hekad -config' found
|
||||
2016-06-10-11-08-04 INFO 1 process(es) hekad is/are listening on port 4352
|
||||
2016-06-10-11-08-04 INFO 1 process(es) hekad is/are listening on port 8325
|
||||
2016-06-10-11-08-05 INFO 1 process(es) hekad is/are listening on port 5567
|
||||
2016-06-10-11-08-05 INFO 1 process(es) hekad is/are listening on port 4353
|
||||
[...]
|
||||
|
||||
In the example above, the diagnostic tool reports that two *hekad*
|
||||
processes are runing on *node-3* which is the expected outcome.
|
||||
In the case where one *hekad* process is not be running, the
|
||||
diagnostic tool would report an error as shown below::
|
||||
|
||||
root@node-3:~# lma_diagnostics
|
||||
2016-06-10-11-11-48 INFO node-3.test.domain.local role ["controller"]
|
||||
2016-06-10-11-11-48 INFO ** LMA Collector
|
||||
2016-06-10-11-11-48 **ERROR 1 'hekad -config' processes found, 2 expected!**
|
||||
2016-06-10-11-11-48 **ERROR 'hekad' process does not LISTEN on port: 4352**
|
||||
[...]
|
||||
|
||||
Here, two errors are reported:
|
||||
|
||||
1. There is only one *hekad* process running instead of two.
|
||||
2. No *hekad* process is listening on port 4352.
|
||||
|
||||
This is one example of the type of checks performed by the
|
||||
diagnostic tool but there are many others.
|
||||
On the OpenStack nodes, the diagnostic's results are stored
|
||||
in ``/var/lma_diagnostics/diagnostics.log``.
|
||||
|
||||
**A successful LMA toolchain diagnostic should be free of errors**.
|
||||
|
|
|
@ -13,6 +13,7 @@ Welcome to the StackLight Collector Documentation!
|
|||
licenses
|
||||
appendix_a
|
||||
appendix_b
|
||||
appendix_c
|
||||
|
||||
Indices and Tables
|
||||
==================
|
||||
|
|
|
@ -3,76 +3,84 @@
|
|||
Overview
|
||||
========
|
||||
|
||||
The LMA Collector is the advanced monitoring agent of the
|
||||
so called Logging, Monitoring and Alerting (LMA) Toolchain of Mirantis OpenStack,
|
||||
which is now officially called the **StackLight Collector** (or just the *collector*).
|
||||
The **StackLight Collector Plugin** is used to install and configure
|
||||
several software components that are used to collect and process all the
|
||||
data that we think is relevant to provide deep operational insights about
|
||||
your OpenStack environment. These finely integrated components are
|
||||
collectively referred to as the **StackLight Collector** (or just **the Collector**).
|
||||
|
||||
The StackLight Collector should be installed on each of the OpenStack nodes you
|
||||
want to monitor. It is a key component of the
|
||||
`LMA Toolchain of Mirantis OpenStack <https://launchpad.net/lma-toolchain>`_
|
||||
as shown in the figure below:
|
||||
.. note:: The Collector has evolved over time and so the term
|
||||
'collector' is a little bit of a misnomer since it is
|
||||
more of a **smart monitoring agent** than a mere data 'collector'.
|
||||
|
||||
The Collecor is a key component of the so-called
|
||||
`Logging, Monitoring and Alerting toolchain of Mirantis OpenStack
|
||||
<https://launchpad.net/lma-toolchain>`_ (a.k.a StackLight).
|
||||
|
||||
.. image:: ../../images/toolchain_map.png
|
||||
:align: center
|
||||
|
||||
Each *collector* is individually responsible for supporting the sensing,
|
||||
measurement, collection, analysis and alarm functions for the node
|
||||
it is running on.
|
||||
The Collector is installed on every node of your OpenStack
|
||||
environment. Each Collector is individually responsible for supporting
|
||||
all the monitoring functions of your OpenStack environment for both
|
||||
the operating system and the services running on the node.
|
||||
Note also that the Collector running on the *primary controller*
|
||||
(the controller which owns the management VIP) is called the
|
||||
**Aggregator** since it performs additional aggregation and correlation
|
||||
functions. The Aggregator is the central point of convergence for
|
||||
all the faults and anomalies detected at the node level. The
|
||||
fundamental role of the Aggregator is to issue an opinion about the
|
||||
health status of your OpenStack environment at the cluster
|
||||
level. As such, the Collector may be viewed as a monitoring
|
||||
agent for cloud infrastructure clusters.
|
||||
|
||||
A wealth of operational data is collected from a variety of sources including
|
||||
log files, collectd and RabbitMQ for the OpenStack notifications.
|
||||
The main building blocks of the Collector are:
|
||||
|
||||
.. note:: The *collector* which runs on the active controller of the control
|
||||
plane cluster, is called the *aggregator* because it performs additional
|
||||
aggregation and multivariate correlation functions to compute service
|
||||
healthiness metrics at the cluster level.
|
||||
* **collectd** which comes bundled with a collection of monitoring plugins.
|
||||
Some of them are standard collectd plugins while others are purpose-built
|
||||
plugins written in python to perform various OpenStack services checks.
|
||||
* **Heka**, `a golang data processing swiss army knife by Mozilla
|
||||
<https://github.com/mozilla-services/heka>`_.
|
||||
Heka supports a number of standard input and output plugins
|
||||
that allows to ingest data from a variety of sources
|
||||
including collectd, log files and RabbitMQ,
|
||||
as well as to persist the operational data to external backend servers like
|
||||
Elasticsearch, InfluxDB and Nagios for search and further processing.
|
||||
* **A collection of Heka plugins** written in Lua which does
|
||||
the actual data processing such as running metrics transformations,
|
||||
running alarms and logs parsing.
|
||||
|
||||
A primary function of the *collector* is to sanitise and transform the ingested
|
||||
raw operational data into internal message representations using the
|
||||
`Heka message structure <https://hekad.readthedocs.io/en/stable/message/index.html>`_.
|
||||
This message structure is used within the *collector's* plugin framework to match,
|
||||
filter and route messages to plugins written in `Lua <http://www.lua.org/>`_
|
||||
that perform various data analysis and computation functions.
|
||||
.. note:: An important function of the Collector is to normalize
|
||||
the operational data into an internal `Heka message structure
|
||||
<https://hekad.readthedocs.io/en/stable/message/index.html>`_
|
||||
representation that can be ingested into the Heka's stream processing
|
||||
pipeline. The stream processing pipeline uses matching policies to
|
||||
route the Heka messages to the `Lua <http://www.lua.org/>`_ plugins that
|
||||
will perform the actual data computation functions.
|
||||
|
||||
As such, the *collector* may also be described as a pluggable framework
|
||||
for operational data stream processing and routing.
|
||||
There are three types of Lua plugins that were developed for the Collector:
|
||||
|
||||
Its main building blocks are:
|
||||
* The **decoder plugins** to sanitize and normalize the ingested data.
|
||||
* The **filter plugins** to process the data.
|
||||
* The **encoder plugins** to serialize the data that is
|
||||
sent to the backend servers.
|
||||
|
||||
* `collectd <https://collectd.org/>`_ which is bundled with a collection of
|
||||
monitoring plugins. Many of them are purpose-built for OpenStack.
|
||||
* `Heka <https://github.com/mozilla-services/heka>`_ (a golang data processing
|
||||
*swiss army knife* by Mozilla) which is the cornerstone technology of the Collector.
|
||||
Heka supports out-of-the-box a number of input and output plugins that allows
|
||||
the Collector to integrate with a number of external systems' native
|
||||
protocol like Elasticsearch, InfluxDB, Nagios, SMTP, Whisper, Kafka, AMQP and
|
||||
Carbon to name a few.
|
||||
* A collection of Heka plugins written in Lua to decode, process and encode the
|
||||
operational data.
|
||||
There are five types of data sent by the Collector (and the Aggregator)
|
||||
to the backend servers:
|
||||
|
||||
There are three types of Lua plugins running in the *collector*:
|
||||
* The logs and the notifications, which are referred to as events,
|
||||
sent to Elasticsearch for indexing.
|
||||
* The metric's time-series sent to InfluxDB.
|
||||
* The annotation sent to InfluxDB.
|
||||
* The OpenStack environment clusters health status
|
||||
sent as *passive checks* to Nagios
|
||||
|
||||
* The input plugins which collect, sanitize and transform the raw
|
||||
data into an internal message representation which is injected into the
|
||||
Heka pipeline for further processing.
|
||||
* The filter plugins which execute the analysis and correlation functions.
|
||||
* The output plugins which encode and transmit the messages to external
|
||||
systems like Elasticsearch, InfluxDB or Nagios where the data can
|
||||
be further processed and persisted.
|
||||
|
||||
The output of the *collector* and *aggregator* is of four kinds:
|
||||
|
||||
* The logs and notifications which are sent to Elasticsearch for indexing.
|
||||
Elasticsearch combined with Kibana provides insightful log analytics.
|
||||
* The metrics which are sent to InfluxDB.
|
||||
InfluxDB combined with Grafana provides insightful time-series analytics.
|
||||
* The health status metrics for the OpenStack clusters which are sent to Nagios
|
||||
(or via SMTP) for alerting and escalation purposes.
|
||||
* The annotation messages which are sent to InfluxDB. The annotation messages contain
|
||||
information about what caused a service cluster or node cluster to change state.
|
||||
The annotation messages provide root cause analysis hints whenever possible.
|
||||
The annotation messages are also used to construct the alert notifications that are
|
||||
sent via SMTP or to Nagios.
|
||||
.. note:: The annotations are like notification messages
|
||||
which are exposed in Grafana. They contain information about the
|
||||
anomalies and faults that have been detected by the Collector.
|
||||
They basicaly contain the same information as the *passive checks*
|
||||
sent to Nagios. In addition, they may contain 'hints' about what
|
||||
the Collector think could be the root cause of a problem.
|
||||
|
||||
.. _plugin_requirements:
|
||||
|
||||
|
@ -94,16 +102,9 @@ Requirements
|
|||
Limitations
|
||||
-----------
|
||||
|
||||
* The plugin is not compatible with an OpenStack environment deployed with Nova-Network.
|
||||
|
||||
* The Elasticsearch output plugin of the *collector* is configured to use the **drop** policy
|
||||
which implies that the *collector* will start dropping the logs and the OpenStack
|
||||
notifications when the output plugin has reached a buffering limit that is currently
|
||||
set to 1GB by default. This situation can typically happen when the Elasticsearch server
|
||||
has been inaccessible for a long period of time.
|
||||
This limitation may be addressed in a future release of the StackLight Collector Plugin.
|
||||
* The plugin is not compatible with an OpenStack environment deployed with nova-network.
|
||||
|
||||
* When you re-execute tasks on deployed nodes using the Fuel CLI, the *hekad* and
|
||||
*collectd* services will be restarted on these nodes during the post-deployment
|
||||
*collectd* processes will be restarted on these nodes during the post-deployment
|
||||
phase. See `bug #1570850
|
||||
<https://bugs.launchpad.net/lma-toolchain/+bug/1570850>`_ for details.
|
||||
|
|
|
@ -12,27 +12,27 @@ Version 0.10.0
|
|||
|
||||
Prior to StackLight version 0.10.0, there was one instance of the *hekad*
|
||||
process running to process both the logs and the metrics. Starting with StackLight
|
||||
version 0.10.0, the processing of logs and notifications is separated
|
||||
from the processing of metrics into two different *hekad* instances.
|
||||
This allows for better performance and flow control mechanisms when the
|
||||
version 0.10.0, the processing of the logs and notifications is separated
|
||||
from the processing of the metrics in two different *hekad* instances.
|
||||
This allows for better performance and control of the flow when the
|
||||
maximum buffer size on disk has reached a limit. With the *hekad* instance
|
||||
processing the metrics, the buffering policy mandates to drop the metrics
|
||||
when the maximum buffer size is reached. With the *hekad* instance
|
||||
processing the logs, the buffering policy mandates to block the
|
||||
entire processing pipeline. This way, one can avoid
|
||||
losing logs (and notifications) in cases when the Elasticsearch
|
||||
server has been inaccessible for a long period of time.
|
||||
As a result, the StackLight collector has now two services running
|
||||
on a node:
|
||||
entire processing pipeline. This way, we can avoid
|
||||
losing logs (and notifications) when the Elasticsearch
|
||||
server is inaccessible for a long period of time.
|
||||
As a result, the StackLight collector has now two processes running
|
||||
on the node:
|
||||
|
||||
* The **log_collector** service
|
||||
* The **metric_collector** service
|
||||
* One for the *log_collector* service
|
||||
* One for the *metric_collector* service
|
||||
|
||||
* Metrics derived from logs are now aggregated
|
||||
* Metrics derived from logs are aggregated by the *log_collector* service.
|
||||
|
||||
To avoid flooding the *metric_collector* with bursts of metrics derived
|
||||
from logs, the *log_collector* service sends aggregated metrics
|
||||
by bulk to the *metric_collector* service.
|
||||
from logs, the *log_collector* service sends metrics by bulk to the
|
||||
*metric_collector* service.
|
||||
An example of aggregated metric derived from logs is the
|
||||
`openstack_<service>_http_response_time_stats
|
||||
<http://fuel-plugin-lma-collector.readthedocs.io/en/latest/appendix_b.html#api-response-times>`_.
|
||||
|
@ -41,10 +41,10 @@ Version 0.10.0
|
|||
|
||||
A diagnostic tool is now available to help diagnose problems.
|
||||
The diagnostic tool checks that the toolchain is properly installed
|
||||
and configured across the entire StackLight LMA toolchain. Please check the
|
||||
the `Troubleshooting Chapter
|
||||
<http://fuel-plugin-lma-collector.readthedocs.io/en/latest/configuration.html#troubleshooting>`_
|
||||
of the User Guide for more information.
|
||||
and configured across the entire LMA toolchain. Please check the
|
||||
`Diagnostic Tool
|
||||
<http://fuel-plugin-lma-collector.readthedocs.io/en/latest/configuration.html#diagnostic>`_
|
||||
section of the User Guide for more information.
|
||||
|
||||
* Bug fixes
|
||||
|
||||
|
|
Loading…
Reference in New Issue