Foundation for LMA docs
This begins building documentation for the LMA services included in openstack-helm-infra. This includes documentation for: kibana, elasticsearch, fluent-logging, grafana, prometheus, and nagios Change-Id: Iaa24be04748e76fabca998972398802e7e921ef1 Signed-off-by: Steve Wilkerson <wilkers.steve@gmail.com>
This commit is contained in:
parent
1c87af7856
commit
eab9ca05a6
@ -8,7 +8,9 @@ Contents:
|
|||||||
|
|
||||||
install/index
|
install/index
|
||||||
testing/index
|
testing/index
|
||||||
|
monitoring/index
|
||||||
|
logging/index
|
||||||
|
readme
|
||||||
|
|
||||||
Indices and Tables
|
Indices and Tables
|
||||||
==================
|
==================
|
||||||
|
196
doc/source/logging/elasticsearch.rst
Normal file
196
doc/source/logging/elasticsearch.rst
Normal file
@ -0,0 +1,196 @@
|
|||||||
|
Elasticsearch
|
||||||
|
=============
|
||||||
|
|
||||||
|
The Elasticsearch chart in openstack-helm-infra provides a distributed data
|
||||||
|
store to index and analyze logs generated from the OpenStack-Helm services.
|
||||||
|
The chart contains templates for:
|
||||||
|
|
||||||
|
- Elasticsearch client nodes
|
||||||
|
- Elasticsearch data nodes
|
||||||
|
- Elasticsearch master nodes
|
||||||
|
- An Elasticsearch exporter for providing cluster metrics to Prometheus
|
||||||
|
- A cronjob for Elastic Curator to manage data indices
|
||||||
|
|
||||||
|
Authentication
|
||||||
|
--------------
|
||||||
|
|
||||||
|
The Elasticsearch deployment includes a sidecar container that runs an Apache
|
||||||
|
reverse proxy to add authentication capabilities for Elasticsearch. The
|
||||||
|
username and password are configured under the Elasticsearch entry in the
|
||||||
|
endpoints section of the chart's values.yaml.
|
||||||
|
|
||||||
|
The configuration for Apache can be found under the conf.httpd key, and uses a
|
||||||
|
helm-toolkit function that allows for including gotpl entries in the template
|
||||||
|
directly. This allows the use of other templates, like the endpoint lookup
|
||||||
|
function templates, directly in the configuration for Apache.
|
||||||
|
|
||||||
|
Elasticsearch Service Configuration
|
||||||
|
-----------------------------------
|
||||||
|
|
||||||
|
The Elasticsearch service configuration file can be modified with a combination
|
||||||
|
of pod environment variables and entries in the values.yaml file. Elasticsearch
|
||||||
|
does not require much configuration out of the box, and the default values for
|
||||||
|
these configuration settings are meant to provide a highly available cluster by
|
||||||
|
default.
|
||||||
|
|
||||||
|
The vital entries in this configuration file are:
|
||||||
|
|
||||||
|
- path.data: The path at which to store the indexed data
|
||||||
|
- path.repo: The location of any snapshot repositories to backup indexes
|
||||||
|
- bootstrap.memory_lock: Ensures none of the JVM is swapped to disk
|
||||||
|
- discovery.zen.minimum_master_nodes: Minimum required masters for the cluster
|
||||||
|
|
||||||
|
The bootstrap.memory_lock entry ensures none of the JVM will be swapped to disk
|
||||||
|
during execution, and setting this value to false will negatively affect the
|
||||||
|
health of your Elasticsearch nodes. The discovery.zen.minimum_master_nodes flag
|
||||||
|
registers the minimum number of masters required for your Elasticsearch cluster
|
||||||
|
to register as healthy and functional.
|
||||||
|
|
||||||
|
To read more about Elasticsearch's configuration file, please see the official
|
||||||
|
documentation_.
|
||||||
|
|
||||||
|
.. _documentation: https://www.elastic.co/guide/en/elasticsearch/reference/current/important-settings.html
|
||||||
|
|
||||||
|
Elastic Curator
|
||||||
|
---------------
|
||||||
|
|
||||||
|
The Elasticsearch chart contains a cronjob to run Elastic Curator at specified
|
||||||
|
intervals to manage the lifecycle of your indices. Curator can perform:
|
||||||
|
|
||||||
|
- Take and send a snapshot of your indexes to a specified snapshot repository
|
||||||
|
- Delete indexes older than a specified length of time
|
||||||
|
- Restore indexes with previous index snapshots
|
||||||
|
- Reindex an index into a new or preexisting index
|
||||||
|
|
||||||
|
The full list of supported Curator actions can be found in the actions_ section of
|
||||||
|
the official Curator documentation. The list of options available for those
|
||||||
|
actions can be found in the options_ section of the Curator documentation.
|
||||||
|
|
||||||
|
.. _actions: https://www.elastic.co/guide/en/elasticsearch/client/curator/current/actions.html
|
||||||
|
.. _options: https://www.elastic.co/guide/en/elasticsearch/client/curator/current/options.html
|
||||||
|
|
||||||
|
Curator's configuration is handled via entries in Elasticsearch's values.yaml
|
||||||
|
file and must be overridden to achieve your index lifecycle management
|
||||||
|
needs. Please note that any unused field should be left blank, as an entry of
|
||||||
|
"None" will result in an exception, as Curator will read it as a Python NoneType
|
||||||
|
insead of a value of None.
|
||||||
|
|
||||||
|
The section for Curator's service configuration can be found at:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
conf:
|
||||||
|
curator:
|
||||||
|
config:
|
||||||
|
client:
|
||||||
|
hosts:
|
||||||
|
- elasticsearch-logging
|
||||||
|
port: 9200
|
||||||
|
url_prefix:
|
||||||
|
use_ssl: False
|
||||||
|
certificate:
|
||||||
|
client_cert:
|
||||||
|
client_key:
|
||||||
|
ssl_no_validate: False
|
||||||
|
http_auth:
|
||||||
|
timeout: 30
|
||||||
|
master_only: False
|
||||||
|
logging:
|
||||||
|
loglevel: INFO
|
||||||
|
logfile:
|
||||||
|
logformat: default
|
||||||
|
blacklist: ['elasticsearch', 'urllib3']
|
||||||
|
|
||||||
|
Curator's actions are configured in the following section:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
conf:
|
||||||
|
curator:
|
||||||
|
action_file:
|
||||||
|
actions:
|
||||||
|
1:
|
||||||
|
action: delete_indices
|
||||||
|
description: "Clean up ES by deleting old indices"
|
||||||
|
options:
|
||||||
|
timeout_override:
|
||||||
|
continue_if_exception: False
|
||||||
|
ignore_empty_list: True
|
||||||
|
disable_action: True
|
||||||
|
filters:
|
||||||
|
- filtertype: age
|
||||||
|
source: name
|
||||||
|
direction: older
|
||||||
|
timestring: '%Y.%m.%d'
|
||||||
|
unit: days
|
||||||
|
unit_count: 30
|
||||||
|
field:
|
||||||
|
stats_result:
|
||||||
|
epoch:
|
||||||
|
exclude: False
|
||||||
|
|
||||||
|
The Elasticsearch chart contains commented example actions for deleting and
|
||||||
|
snapshotting indexes older 30 days. Please note these actions are provided as a
|
||||||
|
reference and are disabled by default to avoid any unexpected behavior against
|
||||||
|
your indexes.
|
||||||
|
|
||||||
|
Elasticsearch Exporter
|
||||||
|
----------------------
|
||||||
|
|
||||||
|
The Elasticsearch chart contains templates for an exporter to provide metrics
|
||||||
|
for Prometheus. These metrics provide insight into the performance and overall
|
||||||
|
health of your Elasticsearch cluster. Please note monitoring for Elasticsearch
|
||||||
|
is disabled by default, and must be enabled with the following override:
|
||||||
|
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
monitoring:
|
||||||
|
prometheus:
|
||||||
|
enabled: true
|
||||||
|
|
||||||
|
|
||||||
|
The Elasticsearch exporter uses the same service annotations as the other
|
||||||
|
exporters, and no additional configuration is required for Prometheus to target
|
||||||
|
the Elasticsearch exporter for scraping. The Elasticsearch exporter is
|
||||||
|
configured with command line flags, and the flags' default values can be found
|
||||||
|
under the following key in the values.yaml file:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
conf:
|
||||||
|
prometheus_elasticsearch_exporter:
|
||||||
|
es:
|
||||||
|
all: true
|
||||||
|
timeout: 20s
|
||||||
|
|
||||||
|
The configuration keys configure the following behaviors:
|
||||||
|
|
||||||
|
- es.all: Gather information from all nodes, not just the connecting node
|
||||||
|
- es.timeout: Timeout for metrics queries
|
||||||
|
|
||||||
|
More information about the Elasticsearch exporter can be found on the exporter's
|
||||||
|
GitHub_ page.
|
||||||
|
|
||||||
|
.. _GitHub: https://github.com/justwatchcom/elasticsearch_exporter
|
||||||
|
|
||||||
|
|
||||||
|
Snapshot Repositories
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
Before Curator can store snapshots in a specified repository, Elasticsearch must
|
||||||
|
register the configured repository. To achieve this, the Elasticsearch chart
|
||||||
|
contains a job for registering an s3 snapshot repository backed by radosgateway.
|
||||||
|
This job is disabled by default as the curator actions for snapshots are
|
||||||
|
disabled by default. To enable the snapshot job, the
|
||||||
|
conf.elasticsearch.snapshots.enabled flag must be set to true. The following
|
||||||
|
configuration keys are relevant:
|
||||||
|
|
||||||
|
- conf.elasticsearch.snapshots.enabled: Enable snapshot repositories
|
||||||
|
- conf.elasticsearch.snapshots.bucket: Name of the RGW s3 bucket to use
|
||||||
|
- conf.elasticsearch.snapshots.repositories: Name of repositories to create
|
||||||
|
|
||||||
|
More information about Elasticsearch repositories can be found in the official
|
||||||
|
Elasticsearch snapshot_ documentation:
|
||||||
|
|
||||||
|
.. _snapshot: https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-snapshots.html#_repositories
|
279
doc/source/logging/fluent-logging.rst
Normal file
279
doc/source/logging/fluent-logging.rst
Normal file
@ -0,0 +1,279 @@
|
|||||||
|
Fluent-logging
|
||||||
|
===============
|
||||||
|
|
||||||
|
The fluent-logging chart in openstack-helm-infra provides the base for a
|
||||||
|
centralized logging platform for OpenStack-Helm. The chart combines two
|
||||||
|
services, Fluentbit and Fluentd, to gather logs generated by the services,
|
||||||
|
filter on or add metadata to logged events, then forward them to Elasticsearch
|
||||||
|
for indexing.
|
||||||
|
|
||||||
|
Fluentbit
|
||||||
|
---------
|
||||||
|
|
||||||
|
Fluentbit runs as a log-collecting component on each host in the cluster, and
|
||||||
|
can be configured to target specific log locations on the host. The Fluentbit_
|
||||||
|
configuration schema can be found on the official Fluentbit website.
|
||||||
|
|
||||||
|
.. _Fluentbit: http://fluentbit.io/documentation/0.12/configuration/schema.html
|
||||||
|
|
||||||
|
Fluentbit provides a set of plug-ins for ingesting and filtering various log
|
||||||
|
types. These plug-ins include:
|
||||||
|
|
||||||
|
- Tail: Tails a defined file for logged events
|
||||||
|
- Kube: Adds Kubernetes metadata to a logged event
|
||||||
|
- Systemd: Provides ability to collect logs from the journald daemon
|
||||||
|
- Syslog: Provides the ability to collect logs from a Unix socket (TCP or UDP)
|
||||||
|
|
||||||
|
The complete list of plugins can be found in the configuration_ section of the
|
||||||
|
Fluentbit documentation.
|
||||||
|
|
||||||
|
.. _configuration: http://fluentbit.io/documentation/current/configuration/
|
||||||
|
|
||||||
|
Fluentbit uses parsers to turn unstructured log entries into structured entries
|
||||||
|
to make processing and filtering events easier. The two formats supported are
|
||||||
|
JSON maps and regular expressions. More information about Fluentbit's parsing
|
||||||
|
abilities can be found in the parsers_ section of Fluentbit's documentation.
|
||||||
|
|
||||||
|
.. _parsers: http://fluentbit.io/documentation/current/parser/
|
||||||
|
|
||||||
|
Fluentbit's service and parser configurations are defined via the values.yaml
|
||||||
|
file, which allows for custom definitions of inputs, filters and outputs for
|
||||||
|
your logging needs.
|
||||||
|
Fluentbit's configuration can be found under the following key:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
conf:
|
||||||
|
fluentbit:
|
||||||
|
- service:
|
||||||
|
header: service
|
||||||
|
Flush: 1
|
||||||
|
Daemon: Off
|
||||||
|
Log_Level: info
|
||||||
|
Parsers_File: parsers.conf
|
||||||
|
- containers_tail:
|
||||||
|
header: input
|
||||||
|
Name: tail
|
||||||
|
Tag: kube.*
|
||||||
|
Path: /var/log/containers/*.log
|
||||||
|
Parser: docker
|
||||||
|
DB: /var/log/flb_kube.db
|
||||||
|
Mem_Buf_Limit: 5MB
|
||||||
|
- kube_filter:
|
||||||
|
header: filter
|
||||||
|
Name: kubernetes
|
||||||
|
Match: kube.*
|
||||||
|
Merge_JSON_Log: On
|
||||||
|
- fluentd_output:
|
||||||
|
header: output
|
||||||
|
Name: forward
|
||||||
|
Match: "*"
|
||||||
|
Host: ${FLUENTD_HOST}
|
||||||
|
Port: ${FLUENTD_PORT}
|
||||||
|
|
||||||
|
Fluentbit is configured by default to capture logs at the info log level. To
|
||||||
|
change this, override the Log_Level key with the appropriate levels, which are
|
||||||
|
documented in Fluentbit's configuration_.
|
||||||
|
|
||||||
|
Fluentbit's parser configuration can be found under the following key:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
conf:
|
||||||
|
parsers:
|
||||||
|
- docker:
|
||||||
|
header: parser
|
||||||
|
Name: docker
|
||||||
|
Format: json
|
||||||
|
Time_Key: time
|
||||||
|
Time_Format: "%Y-%m-%dT%H:%M:%S.%L"
|
||||||
|
Time_Keep: On
|
||||||
|
|
||||||
|
The values for the fluentbit and parsers keys are consumed by a fluent-logging
|
||||||
|
helper template that produces the appropriate configurations for the relevant
|
||||||
|
sections. Each list item (keys prefixed with a '-') represents a section in the
|
||||||
|
configuration files, and the arbitrary name of the list item should represent a
|
||||||
|
logical description of the section defined. The header key represents the type
|
||||||
|
of definition (filter, input, output, service or parser), and the remaining
|
||||||
|
entries will be rendered as space delimited configuration keys and values. For
|
||||||
|
example, the definitions above would result in the following:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
[SERVICE]
|
||||||
|
Daemon false
|
||||||
|
Flush 1
|
||||||
|
Log_Level info
|
||||||
|
Parsers_File parsers.conf
|
||||||
|
[INPUT]
|
||||||
|
DB /var/log/flb_kube.db
|
||||||
|
Mem_Buf_Limit 5MB
|
||||||
|
Name tail
|
||||||
|
Parser docker
|
||||||
|
Path /var/log/containers/*.log
|
||||||
|
Tag kube.*
|
||||||
|
[FILTER]
|
||||||
|
Match kube.*
|
||||||
|
Merge_JSON_Log true
|
||||||
|
Name kubernetes
|
||||||
|
[OUTPUT]
|
||||||
|
Host ${FLUENTD_HOST}
|
||||||
|
Match *
|
||||||
|
Name forward
|
||||||
|
Port ${FLUENTD_PORT}
|
||||||
|
[PARSER]
|
||||||
|
Format json
|
||||||
|
Name docker
|
||||||
|
Time_Format %Y-%m-%dT%H:%M:%S.%L
|
||||||
|
Time_Keep true
|
||||||
|
Time_Key time
|
||||||
|
|
||||||
|
Fluentd
|
||||||
|
-------
|
||||||
|
|
||||||
|
Fluentd runs as a forwarding service that receives event entries from Fluentbit
|
||||||
|
and routes them to the appropriate destination. By default, Fluentd will route
|
||||||
|
all entries received from Fluentbit to Elasticsearch for indexing. The
|
||||||
|
Fluentd_ configuration schema can be found at the official Fluentd website.
|
||||||
|
|
||||||
|
.. _Fluentd: https://docs.fluentd.org/v0.12/articles/config-file
|
||||||
|
|
||||||
|
Fluentd's configuration is handled in the values.yaml file in fluent-logging.
|
||||||
|
Similar to Fluentbit, configuration overrides provide flexibility in defining
|
||||||
|
custom routes for tagged log events. The configuration can be found under the
|
||||||
|
following key:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
conf:
|
||||||
|
fluentd:
|
||||||
|
- fluentbit_forward:
|
||||||
|
header: source
|
||||||
|
type: forward
|
||||||
|
port: "#{ENV['FLUENTD_PORT']}"
|
||||||
|
bind: 0.0.0.0
|
||||||
|
- elasticsearch:
|
||||||
|
header: match
|
||||||
|
type: elasticsearch
|
||||||
|
expression: "**"
|
||||||
|
include_tag_key: true
|
||||||
|
host: "#{ENV['ELASTICSEARCH_HOST']}"
|
||||||
|
port: "#{ENV['ELASTICSEARCH_PORT']}"
|
||||||
|
logstash_format: true
|
||||||
|
buffer_chunk_limit: 10M
|
||||||
|
buffer_queue_limit: 32
|
||||||
|
flush_interval: "20"
|
||||||
|
max_retry_wait: 300
|
||||||
|
disable_retry_limit: ""
|
||||||
|
|
||||||
|
The values for the fluentd keys are consumed by a fluent-logging helper template
|
||||||
|
that produces appropriate configurations for each directive desired. The list
|
||||||
|
items (keys prefixed with a '-') represent sections in the configuration file,
|
||||||
|
and the name of each list item should represent a logical description of the
|
||||||
|
section defined. The header key represents the type of definition (name of the
|
||||||
|
fluentd plug-in used), and the expression key is used when the plug-in requires
|
||||||
|
a pattern to match against (example: matches on certain input patterns). The
|
||||||
|
remaining entries will be rendered as space delimited configuration keys and
|
||||||
|
values. For example, the definition above would result in the following:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
<source>
|
||||||
|
bind 0.0.0.0
|
||||||
|
port "#{ENV['FLUENTD_PORT']}"
|
||||||
|
@type forward
|
||||||
|
</source>
|
||||||
|
<match **>
|
||||||
|
buffer_chunk_limit 10M
|
||||||
|
buffer_queue_limit 32
|
||||||
|
disable_retry_limit
|
||||||
|
flush_interval 20s
|
||||||
|
host "#{ENV['ELASTICSEARCH_HOST']}"
|
||||||
|
include_tag_key true
|
||||||
|
logstash_format true
|
||||||
|
max_retry_wait 300
|
||||||
|
port "#{ENV['ELASTICSEARCH_PORT']}"
|
||||||
|
@type elasticsearch
|
||||||
|
</match>
|
||||||
|
|
||||||
|
Some fluentd plug-ins require nested definitions. The fluentd helper template
|
||||||
|
can handle these definitions with the following structure:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
conf:
|
||||||
|
td_agent:
|
||||||
|
- fluentbit_forward:
|
||||||
|
header: source
|
||||||
|
type: forward
|
||||||
|
port: "#{ENV['FLUENTD_PORT']}"
|
||||||
|
bind: 0.0.0.0
|
||||||
|
- log_transformer:
|
||||||
|
header: filter
|
||||||
|
type: record_transformer
|
||||||
|
expression: "foo.bar"
|
||||||
|
inner_def:
|
||||||
|
- record_transformer:
|
||||||
|
header: record
|
||||||
|
hostname: my_host
|
||||||
|
tag: my_tag
|
||||||
|
|
||||||
|
In this example, the my_transformer list will generate a nested configuration
|
||||||
|
entry in the log_transformer section. The nested definitions are handled by
|
||||||
|
supplying a list as the value for an arbitrary key, and the list value will
|
||||||
|
indicate the entry should be handled as a nested definition. The helper
|
||||||
|
template will render the above example key/value pairs as the following:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
<source>
|
||||||
|
bind 0.0.0.0
|
||||||
|
port "#{ENV['FLUENTD_PORT']}"
|
||||||
|
@type forward
|
||||||
|
</source>
|
||||||
|
<filter foo.bar>
|
||||||
|
<record>
|
||||||
|
hostname my_host
|
||||||
|
tag my_tag
|
||||||
|
</record>
|
||||||
|
@type record_transformer
|
||||||
|
</filter>
|
||||||
|
|
||||||
|
Fluentd Exporter
|
||||||
|
----------------------
|
||||||
|
|
||||||
|
The fluent-logging chart contains templates for an exporter to provide metrics
|
||||||
|
for Fluentd. These metrics provide insight into Fluentd's performance. Please
|
||||||
|
note monitoring for Fluentd is disabled by default, and must be enabled with the
|
||||||
|
following override:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
monitoring:
|
||||||
|
prometheus:
|
||||||
|
enabled: true
|
||||||
|
|
||||||
|
|
||||||
|
The Fluentd exporter uses the same service annotations as the other exporters,
|
||||||
|
and no additional configuration is required for Prometheus to target the
|
||||||
|
Fluentd exporter for scraping. The Fluentd exporter is configured with command
|
||||||
|
line flags, and the flags' default values can be found under the following key
|
||||||
|
in the values.yaml file:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
conf:
|
||||||
|
fluentd_exporter:
|
||||||
|
log:
|
||||||
|
format: "logger:stdout?json=true"
|
||||||
|
level: "info"
|
||||||
|
|
||||||
|
The configuration keys configure the following behaviors:
|
||||||
|
|
||||||
|
- log.format: Define the logger used and format of the output
|
||||||
|
- log.level: Log level for the exporter to use
|
||||||
|
|
||||||
|
More information about the Fluentd exporter can be found on the exporter's
|
||||||
|
GitHub_ page.
|
||||||
|
|
||||||
|
.. _GitHub: https://github.com/V3ckt0r/fluentd_exporter
|
11
doc/source/logging/index.rst
Normal file
11
doc/source/logging/index.rst
Normal file
@ -0,0 +1,11 @@
|
|||||||
|
OpenStack-Helm Logging
|
||||||
|
======================
|
||||||
|
|
||||||
|
Contents:
|
||||||
|
|
||||||
|
.. toctree::
|
||||||
|
:maxdepth: 2
|
||||||
|
|
||||||
|
elasticsearch
|
||||||
|
fluent-logging
|
||||||
|
kibana
|
76
doc/source/logging/kibana.rst
Normal file
76
doc/source/logging/kibana.rst
Normal file
@ -0,0 +1,76 @@
|
|||||||
|
Kibana
|
||||||
|
======
|
||||||
|
|
||||||
|
The Kibana chart in OpenStack-Helm Infra provides visualization for logs indexed
|
||||||
|
into Elasticsearch. These visualizations provide the means to view logs captured
|
||||||
|
from services deployed in cluster and targeted for collection by Fluentbit.
|
||||||
|
|
||||||
|
Authentication
|
||||||
|
--------------
|
||||||
|
|
||||||
|
The Kibana deployment includes a sidecar container that runs an Apache reverse
|
||||||
|
proxy to add authentication capabilities for Kibana. The username and password
|
||||||
|
are configured under the Kibana entry in the endpoints section of the chart's
|
||||||
|
values.yaml.
|
||||||
|
|
||||||
|
The configuration for Apache can be found under the conf.httpd key, and uses a
|
||||||
|
helm-toolkit function that allows for including gotpl entries in the template
|
||||||
|
directly. This allows the use of other templates, like the endpoint lookup
|
||||||
|
function templates, directly in the configuration for Apache.
|
||||||
|
|
||||||
|
Configuration
|
||||||
|
-------------
|
||||||
|
|
||||||
|
Kibana's configuration is driven by the chart's values.yaml file. The configuration
|
||||||
|
options are found under the following keys:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
conf:
|
||||||
|
elasticsearch:
|
||||||
|
pingTimeout: 1500
|
||||||
|
preserveHost: true
|
||||||
|
requestTimeout: 30000
|
||||||
|
shardTimeout: 0
|
||||||
|
startupTimeout: 5000
|
||||||
|
il8n:
|
||||||
|
defaultLocale: en
|
||||||
|
kibana:
|
||||||
|
defaultAppId: discover
|
||||||
|
index: .kibana
|
||||||
|
logging:
|
||||||
|
quiet: false
|
||||||
|
silent: false
|
||||||
|
verbose: false
|
||||||
|
ops:
|
||||||
|
interval: 5000
|
||||||
|
server:
|
||||||
|
host: localhost
|
||||||
|
maxPayloadBytes: 1048576
|
||||||
|
port: 5601
|
||||||
|
ssl:
|
||||||
|
enabled: false
|
||||||
|
|
||||||
|
The case of the sub-keys is important as these values are injected into
|
||||||
|
Kibana's configuration configmap with the toYaml function. More information on
|
||||||
|
the configuration options and available settings can be found in the official
|
||||||
|
Kibana documentation_.
|
||||||
|
|
||||||
|
.. _documentation: https://www.elastic.co/guide/en/kibana/current/settings.html
|
||||||
|
|
||||||
|
Installation
|
||||||
|
------------
|
||||||
|
|
||||||
|
.. code_block: bash
|
||||||
|
|
||||||
|
helm install --namespace=<namespace> local/kibana --name=kibana
|
||||||
|
|
||||||
|
Setting Time Field
|
||||||
|
------------------
|
||||||
|
|
||||||
|
For Kibana to successfully read the logs from Elasticsearch's indexes, the time
|
||||||
|
field will need to be manually set after Kibana has successfully deployed. Upon
|
||||||
|
visiting the Kibana dashboard for the first time, a prompt will appear to choose the
|
||||||
|
time field with a drop down menu. The default time field for Elasticsearch indexes
|
||||||
|
is '@timestamp'. Once this field is selected, the default view for querying log entries
|
||||||
|
can be found by selecting the "Discover"
|
89
doc/source/monitoring/grafana.rst
Normal file
89
doc/source/monitoring/grafana.rst
Normal file
@ -0,0 +1,89 @@
|
|||||||
|
Grafana
|
||||||
|
=======
|
||||||
|
|
||||||
|
The Grafana chart in OpenStack-Helm Infra provides default dashboards for the
|
||||||
|
metrics gathered with Prometheus. The default dashboards include visualizations
|
||||||
|
for metrics on: Ceph, Kubernetes, nodes, containers, MySQL, RabbitMQ, and
|
||||||
|
OpenStack.
|
||||||
|
|
||||||
|
Configuration
|
||||||
|
-------------
|
||||||
|
|
||||||
|
Grafana
|
||||||
|
~~~~~~~
|
||||||
|
|
||||||
|
Grafana's configuration is driven with the chart's values.YAML file, and the
|
||||||
|
relevant configuration entries are under the following key:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
conf:
|
||||||
|
grafana:
|
||||||
|
paths:
|
||||||
|
server:
|
||||||
|
database:
|
||||||
|
session:
|
||||||
|
security:
|
||||||
|
users:
|
||||||
|
log:
|
||||||
|
log.console:
|
||||||
|
dashboards.json:
|
||||||
|
grafana_net:
|
||||||
|
|
||||||
|
These keys correspond to sections in the grafana.ini configuration file, and the
|
||||||
|
to_ini helm-toolkit function will render these values into the appropriate
|
||||||
|
format in grafana.ini. The list of options for these keys can be found in the
|
||||||
|
official Grafana configuration_ documentation.
|
||||||
|
|
||||||
|
.. _configuration: http://docs.grafana.org/installation/configuration/
|
||||||
|
|
||||||
|
Prometheus Data Source
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
Grafana requires configured data sources for gathering metrics for display in
|
||||||
|
its dashboards. The configuration options for datasources are found under the
|
||||||
|
following key in Grafana's values.YAML file:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
conf:
|
||||||
|
provisioning:
|
||||||
|
datasources;
|
||||||
|
monitoring:
|
||||||
|
name: prometheus
|
||||||
|
type: prometheus
|
||||||
|
access: proxy
|
||||||
|
orgId: 1
|
||||||
|
editable: true
|
||||||
|
basicAuth: true
|
||||||
|
|
||||||
|
The Grafana chart will use the keys under each entry beneath
|
||||||
|
.conf.provisioning.datasources as inputs to a helper template that will render
|
||||||
|
the appropriate configuration for the data source. The key for each data source
|
||||||
|
(monitoring in the above example) should map to an entry in the endpoints
|
||||||
|
section in the chart's values.yaml, as the data source's URL and authentication
|
||||||
|
credentials will be populated by the values defined in the defined endpoint.
|
||||||
|
|
||||||
|
.. _sources: http://docs.grafana.org/features/datasources/
|
||||||
|
|
||||||
|
Dashboards
|
||||||
|
~~~~~~~~~~
|
||||||
|
|
||||||
|
Grafana adds dashboards during installation with dashboards defined in YAML under
|
||||||
|
the following key:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
conf:
|
||||||
|
dashboards:
|
||||||
|
|
||||||
|
|
||||||
|
These YAML definitiions are transformed to JSON are added to Grafana's
|
||||||
|
configuration configmap and mounted to the Grafana pods dynamically, allowing for
|
||||||
|
flexibility in defining and adding custom dashboards to Grafana. Dashboards can
|
||||||
|
be added by inserting a new key along with a YAML dashboard definition as the
|
||||||
|
value. Additional dashboards can be found by searching on Grafana's dashboards_
|
||||||
|
page or you can define your own. A json-to-YAML tool, such as json2yaml_ , will
|
||||||
|
help transform any custom or new dashboards from JSON to YAML.
|
||||||
|
|
||||||
|
.. _json2yaml: https://www.json2yaml.com/
|
11
doc/source/monitoring/index.rst
Normal file
11
doc/source/monitoring/index.rst
Normal file
@ -0,0 +1,11 @@
|
|||||||
|
OpenStack-Helm Monitoring
|
||||||
|
=========================
|
||||||
|
|
||||||
|
Contents:
|
||||||
|
|
||||||
|
.. toctree::
|
||||||
|
:maxdepth: 2
|
||||||
|
|
||||||
|
grafana
|
||||||
|
prometheus
|
||||||
|
nagios
|
365
doc/source/monitoring/nagios.rst
Normal file
365
doc/source/monitoring/nagios.rst
Normal file
@ -0,0 +1,365 @@
|
|||||||
|
Nagios
|
||||||
|
======
|
||||||
|
|
||||||
|
The Nagios chart in openstack-helm-infra can be used to provide an alarming
|
||||||
|
service that's tightly coupled to an OpenStack-Helm deployment. The Nagios
|
||||||
|
chart uses a custom Nagios core image that includes plugins developed to query
|
||||||
|
Prometheus directly for scraped metrics and triggered alarms, query the Ceph
|
||||||
|
manager endpoints directly to determine the health of a Ceph cluster, and to
|
||||||
|
query Elasticsearch for logged events that meet certain criteria (experimental).
|
||||||
|
|
||||||
|
Authentication
|
||||||
|
--------------
|
||||||
|
|
||||||
|
The Nagios deployment includes a sidecar container that runs an Apache reverse
|
||||||
|
proxy to add authentication capabilities for Nagios. The username and password
|
||||||
|
are configured under the nagios entry in the endpoints section of the chart's
|
||||||
|
values.yaml.
|
||||||
|
|
||||||
|
The configuration for Apache can be found under the conf.httpd key, and uses a
|
||||||
|
helm-toolkit function that allows for including gotpl entries in the template
|
||||||
|
directly. This allows the use of other templates, like the endpoint lookup
|
||||||
|
function templates, directly in the configuration for Apache.
|
||||||
|
|
||||||
|
Image Plugins
|
||||||
|
-------------
|
||||||
|
|
||||||
|
The Nagios image used contains custom plugins that can be used for the defined
|
||||||
|
service check commands. These plugins include:
|
||||||
|
|
||||||
|
- check_prometheus_metric.py: Query Prometheus for a specific metric and value
|
||||||
|
- check_exporter_health_metric.sh: Nagios plugin to query prometheus exporter
|
||||||
|
- check_rest_get_api.py: Check REST API status
|
||||||
|
- check_update_prometheus_hosts.py: Queries Prometheus, updates Nagios config
|
||||||
|
- query_prometheus_alerts.py: Nagios plugin to query prometheus ALERTS metric
|
||||||
|
|
||||||
|
More information about the Nagios image and plugins can be found here_.
|
||||||
|
|
||||||
|
.. _here: https://github.com/att-comdev/nagios
|
||||||
|
|
||||||
|
|
||||||
|
Nagios Service Configuration
|
||||||
|
----------------------------
|
||||||
|
|
||||||
|
The Nagios service is configured via the following section in the chart's
|
||||||
|
values file:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
conf:
|
||||||
|
nagios:
|
||||||
|
nagios:
|
||||||
|
log_file: /opt/nagios/var/log/nagios.log
|
||||||
|
cfg_file:
|
||||||
|
- /opt/nagios/etc/nagios_objects.cfg
|
||||||
|
- /opt/nagios/etc/objects/commands.cfg
|
||||||
|
- /opt/nagios/etc/objects/contacts.cfg
|
||||||
|
- /opt/nagios/etc/objects/timeperiods.cfg
|
||||||
|
- /opt/nagios/etc/objects/templates.cfg
|
||||||
|
- /opt/nagios/etc/objects/prometheus_discovery_objects.cfg
|
||||||
|
object_cache_file: /opt/nagios/var/objects.cache
|
||||||
|
precached_object_file: /opt/nagios/var/objects.precache
|
||||||
|
resource_file: /opt/nagios/etc/resource.cfg
|
||||||
|
status_file: /opt/nagios/var/status.dat
|
||||||
|
status_update_interval: 10
|
||||||
|
nagios_user: nagios
|
||||||
|
nagios_group: nagios
|
||||||
|
check_external_commands: 1
|
||||||
|
command_file: /opt/nagios/var/rw/nagios.cmd
|
||||||
|
lock_file: /var/run/nagios.lock
|
||||||
|
temp_file: /opt/nagios/var/nagios.tmp
|
||||||
|
temp_path: /tmp
|
||||||
|
event_broker_options: -1
|
||||||
|
log_rotation_method: d
|
||||||
|
log_archive_path: /opt/nagios/var/log/archives
|
||||||
|
use_syslog: 1
|
||||||
|
log_service_retries: 1
|
||||||
|
log_host_retries: 1
|
||||||
|
log_event_handlers: 1
|
||||||
|
log_initial_states: 0
|
||||||
|
log_current_states: 1
|
||||||
|
log_external_commands: 1
|
||||||
|
log_passive_checks: 1
|
||||||
|
service_inter_check_delay_method: s
|
||||||
|
max_service_check_spread: 30
|
||||||
|
service_interleave_factor: s
|
||||||
|
host_inter_check_delay_method: s
|
||||||
|
max_host_check_spread: 30
|
||||||
|
max_concurrent_checks: 60
|
||||||
|
check_result_reaper_frequency: 10
|
||||||
|
max_check_result_reaper_time: 30
|
||||||
|
check_result_path: /opt/nagios/var/spool/checkresults
|
||||||
|
max_check_result_file_age: 3600
|
||||||
|
cached_host_check_horizon: 15
|
||||||
|
cached_service_check_horizon: 15
|
||||||
|
enable_predictive_host_dependency_checks: 1
|
||||||
|
enable_predictive_service_dependency_checks: 1
|
||||||
|
soft_state_dependencies: 0
|
||||||
|
auto_reschedule_checks: 0
|
||||||
|
auto_rescheduling_interval: 30
|
||||||
|
auto_rescheduling_window: 180
|
||||||
|
service_check_timeout: 60
|
||||||
|
host_check_timeout: 60
|
||||||
|
event_handler_timeout: 60
|
||||||
|
notification_timeout: 60
|
||||||
|
ocsp_timeout: 5
|
||||||
|
perfdata_timeout: 5
|
||||||
|
retain_state_information: 1
|
||||||
|
state_retention_file: /opt/nagios/var/retention.dat
|
||||||
|
retention_update_interval: 60
|
||||||
|
use_retained_program_state: 1
|
||||||
|
use_retained_scheduling_info: 1
|
||||||
|
retained_host_attribute_mask: 0
|
||||||
|
retained_service_attribute_mask: 0
|
||||||
|
retained_process_host_attribute_mask: 0
|
||||||
|
retained_process_service_attribute_mask: 0
|
||||||
|
retained_contact_host_attribute_mask: 0
|
||||||
|
retained_contact_service_attribute_mask: 0
|
||||||
|
interval_length: 1
|
||||||
|
check_workers: 4
|
||||||
|
check_for_updates: 1
|
||||||
|
bare_update_check: 0
|
||||||
|
use_aggressive_host_checking: 0
|
||||||
|
execute_service_checks: 1
|
||||||
|
accept_passive_service_checks: 1
|
||||||
|
execute_host_checks: 1
|
||||||
|
accept_passive_host_checks: 1
|
||||||
|
enable_notifications: 1
|
||||||
|
enable_event_handlers: 1
|
||||||
|
process_performance_data: 0
|
||||||
|
obsess_over_services: 0
|
||||||
|
obsess_over_hosts: 0
|
||||||
|
translate_passive_host_checks: 0
|
||||||
|
passive_host_checks_are_soft: 0
|
||||||
|
check_for_orphaned_services: 1
|
||||||
|
check_for_orphaned_hosts: 1
|
||||||
|
check_service_freshness: 1
|
||||||
|
service_freshness_check_interval: 60
|
||||||
|
check_host_freshness: 0
|
||||||
|
host_freshness_check_interval: 60
|
||||||
|
additional_freshness_latency: 15
|
||||||
|
enable_flap_detection: 1
|
||||||
|
low_service_flap_threshold: 5.0
|
||||||
|
high_service_flap_threshold: 20.0
|
||||||
|
low_host_flap_threshold: 5.0
|
||||||
|
high_host_flap_threshold: 20.0
|
||||||
|
date_format: us
|
||||||
|
use_regexp_matching: 1
|
||||||
|
use_true_regexp_matching: 0
|
||||||
|
daemon_dumps_core: 0
|
||||||
|
use_large_installation_tweaks: 0
|
||||||
|
enable_environment_macros: 0
|
||||||
|
debug_level: 0
|
||||||
|
debug_verbosity: 1
|
||||||
|
debug_file: /opt/nagios/var/nagios.debug
|
||||||
|
max_debug_file_size: 1000000
|
||||||
|
allow_empty_hostgroup_assignment: 1
|
||||||
|
illegal_macro_output_chars: "`~$&|'<>\""
|
||||||
|
|
||||||
|
Nagios CGI Configuration
|
||||||
|
------------------------
|
||||||
|
|
||||||
|
The Nagios CGI configuration is defined via the following section in the chart's
|
||||||
|
values file:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
conf:
|
||||||
|
nagios:
|
||||||
|
cgi:
|
||||||
|
main_config_file: /opt/nagios/etc/nagios.cfg
|
||||||
|
physical_html_path: /opt/nagios/share
|
||||||
|
url_html_path: /nagios
|
||||||
|
show_context_help: 0
|
||||||
|
use_pending_states: 1
|
||||||
|
use_authentication: 0
|
||||||
|
use_ssl_authentication: 0
|
||||||
|
authorized_for_system_information: "*"
|
||||||
|
authorized_for_configuration_information: "*"
|
||||||
|
authorized_for_system_commands: nagiosadmin
|
||||||
|
authorized_for_all_services: "*"
|
||||||
|
authorized_for_all_hosts: "*"
|
||||||
|
authorized_for_all_service_commands: "*"
|
||||||
|
authorized_for_all_host_commands: "*"
|
||||||
|
default_statuswrl_layout: 4
|
||||||
|
ping_syntax: /bin/ping -n -U -c 5 $HOSTADDRESS$
|
||||||
|
refresh_rate: 90
|
||||||
|
result_limit: 100
|
||||||
|
escape_html_tags: 1
|
||||||
|
action_url_target: _blank
|
||||||
|
notes_url_target: _blank
|
||||||
|
lock_author_names: 1
|
||||||
|
navbar_search_for_addresses: 1
|
||||||
|
navbar_search_for_aliases: 1
|
||||||
|
|
||||||
|
Nagios Host Configuration
|
||||||
|
-------------------------
|
||||||
|
|
||||||
|
The Nagios chart includes a single host definition for the Prometheus instance
|
||||||
|
queried for metrics. The host definition can be found under the following
|
||||||
|
values key:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
conf:
|
||||||
|
nagios:
|
||||||
|
hosts:
|
||||||
|
- prometheus:
|
||||||
|
use: linux-server
|
||||||
|
host_name: prometheus
|
||||||
|
alias: "Prometheus Monitoring"
|
||||||
|
address: 127.0.0.1
|
||||||
|
hostgroups: prometheus-hosts
|
||||||
|
check_command: check-prometheus-host-alive
|
||||||
|
|
||||||
|
The address for the Prometheus host is defined by the PROMETHEUS_SERVICE
|
||||||
|
environment variable in the deployment template, which is determined by the
|
||||||
|
monitoring entry in the Nagios chart's endpoints section. The endpoint is then
|
||||||
|
available as a macro for Nagios to use in all Prometheus based queries. For
|
||||||
|
example:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
- check_prometheus_host_alive:
|
||||||
|
command_name: check-prometheus-host-alive
|
||||||
|
command_line: "$USER1$/check_rest_get_api.py --url $USER2$ --warning_response_seconds 5 --critical_response_seconds 10"
|
||||||
|
|
||||||
|
The $USER2$ macro above corresponds to the Prometheus endpoint defined in the
|
||||||
|
PROMETHEUS_SERVICE environment variable. All checks that use the
|
||||||
|
prometheus-hosts hostgroup will map back to the Prometheus host defined by this
|
||||||
|
endpoint.
|
||||||
|
|
||||||
|
Nagios HostGroup Configuration
|
||||||
|
------------------------------
|
||||||
|
|
||||||
|
The Nagios chart includes configuration values for defined host groups under the
|
||||||
|
following values key:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
conf:
|
||||||
|
nagios:
|
||||||
|
host_groups:
|
||||||
|
- prometheus-hosts:
|
||||||
|
hostgroup_name: prometheus-hosts
|
||||||
|
alias: "Prometheus Virtual Host"
|
||||||
|
- base-os:
|
||||||
|
hostgroup_name: base-os
|
||||||
|
alias: "base-os"
|
||||||
|
|
||||||
|
These hostgroups are used to define which group of hosts should be targeted by
|
||||||
|
a particular nagios check. An example of a check that targets Prometheus for a
|
||||||
|
specific metric query would be:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
- check_ceph_monitor_quorum:
|
||||||
|
use: notifying_service
|
||||||
|
hostgroup_name: prometheus-hosts
|
||||||
|
service_description: "CEPH_quorum"
|
||||||
|
check_command: check_prom_alert!ceph_monitor_quorum_low!CRITICAL- ceph monitor quorum does not exist!OK- ceph monitor quorum exists
|
||||||
|
check_interval: 60
|
||||||
|
|
||||||
|
An example of a check that targets all hosts for a base-os type check (memory
|
||||||
|
usage, latency, etc) would be:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
- check_memory_usage:
|
||||||
|
use: notifying_service
|
||||||
|
service_description: Memory_usage
|
||||||
|
check_command: check_memory_usage
|
||||||
|
hostgroup_name: base-os
|
||||||
|
|
||||||
|
These two host groups allow for a wide range of targeted checks for determining
|
||||||
|
the status of all components of an OpenStack-Helm deployment.
|
||||||
|
|
||||||
|
Nagios Command Configuration
|
||||||
|
----------------------------
|
||||||
|
|
||||||
|
The Nagios chart includes configuration values for the command definitions Nagios
|
||||||
|
will use when executing service checks. These values are found under the
|
||||||
|
following key:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
conf:
|
||||||
|
nagios:
|
||||||
|
commands:
|
||||||
|
- send_service_snmp_trap:
|
||||||
|
command_name: send_service_snmp_trap
|
||||||
|
command_line: "$USER1$/send_service_trap.sh '$USER8$' '$HOSTNAME$' '$SERVICEDESC$' $SERVICESTATEID$ '$SERVICEOUTPUT$' '$USER4$' '$USER5$'"
|
||||||
|
- send_host_snmp_trap:
|
||||||
|
command_name: send_host_snmp_trap
|
||||||
|
command_line: "$USER1$/send_host_trap.sh '$USER8$' '$HOSTNAME$' $HOSTSTATEID$ '$HOSTOUTPUT$' '$USER4$' '$USER5$'"
|
||||||
|
- send_service_http_post:
|
||||||
|
command_name: send_service_http_post
|
||||||
|
command_line: "$USER1$/send_http_post_event.py --type service --hostname '$HOSTNAME$' --servicedesc '$SERVICEDESC$' --state_id $SERVICESTATEID$ --output '$SERVICEOUTPUT$' --monitoring_hostname '$HOSTNAME$' --primary_url '$USER6$' --secondary_url '$USER7$'"
|
||||||
|
- send_host_http_post:
|
||||||
|
command_name: send_host_http_post
|
||||||
|
command_line: "$USER1$/send_http_post_event.py --type host --hostname '$HOSTNAME$' --state_id $HOSTSTATEID$ --output '$HOSTOUTPUT$' --monitoring_hostname '$HOSTNAME$' --primary_url '$USER6$' --secondary_url '$USER7$'"
|
||||||
|
- check_prometheus_host_alive:
|
||||||
|
command_name: check-prometheus-host-alive
|
||||||
|
command_line: "$USER1$/check_rest_get_api.py --url $USER2$ --warning_response_seconds 5 --critical_response_seconds 10"
|
||||||
|
|
||||||
|
The list of defined commands can be modified with configuration overrides, which
|
||||||
|
allows for the ability define commands specific to an infrastructure deployment.
|
||||||
|
These commands can include querying Prometheus for metrics on dependencies for a
|
||||||
|
service to determine whether an alert should be raised, executing checks on each
|
||||||
|
host to determine network latency or file system usage, or checking each node
|
||||||
|
for issues with ntp clock skew.
|
||||||
|
|
||||||
|
Note: Since the conf.nagios.commands key contains a list of the defined commands,
|
||||||
|
the entire contents of conf.nagios.commands will need to be overridden if
|
||||||
|
additional commands are desired (due to the immutable nature of lists).
|
||||||
|
|
||||||
|
Nagios Service Check Configuration
|
||||||
|
----------------------------------
|
||||||
|
|
||||||
|
The Nagios chart includes configuration values for the service checks Nagios
|
||||||
|
will execute. These service check commands can be found under the following
|
||||||
|
key:
|
||||||
|
|
||||||
|
::
|
||||||
|
conf:
|
||||||
|
nagios:
|
||||||
|
services:
|
||||||
|
- notifying_service:
|
||||||
|
name: notifying_service
|
||||||
|
use: generic-service
|
||||||
|
flap_detection_enabled: 0
|
||||||
|
process_perf_data: 0
|
||||||
|
contact_groups: snmp_and_http_notifying_contact_group
|
||||||
|
check_interval: 60
|
||||||
|
notification_interval: 120
|
||||||
|
retry_interval: 30
|
||||||
|
register: 0
|
||||||
|
- check_ceph_health:
|
||||||
|
use: notifying_service
|
||||||
|
hostgroup_name: base-os
|
||||||
|
service_description: "CEPH_health"
|
||||||
|
check_command: check_ceph_health
|
||||||
|
check_interval: 300
|
||||||
|
- check_hosts_health:
|
||||||
|
use: generic-service
|
||||||
|
hostgroup_name: prometheus-hosts
|
||||||
|
service_description: "Nodes_health"
|
||||||
|
check_command: check_prom_alert!K8SNodesNotReady!CRITICAL- One or more nodes are not ready.
|
||||||
|
check_interval: 60
|
||||||
|
- check_prometheus_replicas:
|
||||||
|
use: notifying_service
|
||||||
|
hostgroup_name: prometheus-hosts
|
||||||
|
service_description: "Prometheus_replica-count"
|
||||||
|
check_command: check_prom_alert_with_labels!replicas_unavailable_statefulset!statefulset="prometheus"!statefulset {statefulset} has lesser than configured replicas
|
||||||
|
check_interval: 60
|
||||||
|
|
||||||
|
The Nagios service configurations define the checks Nagios will perform. These
|
||||||
|
checks contain keys for defining: the service type to use, the host group to
|
||||||
|
target, the description of the service check, the command the check should use,
|
||||||
|
and the interval at which to trigger the service check. These services can also
|
||||||
|
be extended to provide additional insight into the overall status of a
|
||||||
|
particular service. These services also allow the ability to define advanced
|
||||||
|
checks for determining the overall health and liveness of a service. For
|
||||||
|
example, a service check could trigger an alarm for the OpenStack services when
|
||||||
|
Nagios detects that the relevant database and message queue has become
|
||||||
|
unresponsive.
|
338
doc/source/monitoring/prometheus.rst
Normal file
338
doc/source/monitoring/prometheus.rst
Normal file
@ -0,0 +1,338 @@
|
|||||||
|
Prometheus
|
||||||
|
==========
|
||||||
|
|
||||||
|
The Prometheus chart in openstack-helm-infra provides a time series database and
|
||||||
|
a strong querying language for monitoring various components of OpenStack-Helm.
|
||||||
|
Prometheus gathers metrics by scraping defined service endpoints or pods at
|
||||||
|
specified intervals and indexing them in the underlying time series database.
|
||||||
|
|
||||||
|
Authentication
|
||||||
|
--------------
|
||||||
|
|
||||||
|
The Prometheus deployment includes a sidecar container that runs an Apache
|
||||||
|
reverse proxy to add authentication capabilities for Prometheus. The
|
||||||
|
username and password are configured under the monitoring entry in the endpoints
|
||||||
|
section of the chart's values.yaml.
|
||||||
|
|
||||||
|
The configuration for Apache can be found under the conf.httpd key, and uses a
|
||||||
|
helm-toolkit function that allows for including gotpl entries in the template
|
||||||
|
directly. This allows the use of other templates, like the endpoint lookup
|
||||||
|
function templates, directly in the configuration for Apache.
|
||||||
|
|
||||||
|
Prometheus Service configuration
|
||||||
|
--------------------------------
|
||||||
|
|
||||||
|
The Prometheus service is configured via command line flags set during runtime.
|
||||||
|
These flags include: setting the configuration file, setting log levels, setting
|
||||||
|
characteristics of the time series database, and enabling the web admin API for
|
||||||
|
snapshot support. These settings can be configured via the values tree at:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
conf:
|
||||||
|
prometheus:
|
||||||
|
command_line_flags:
|
||||||
|
log.level: info
|
||||||
|
query.max_concurrency: 20
|
||||||
|
query.timeout: 2m
|
||||||
|
storage.tsdb.path: /var/lib/prometheus/data
|
||||||
|
storage.tsdb.retention: 7d
|
||||||
|
web.enable_admin_api: false
|
||||||
|
web.enable_lifecycle: false
|
||||||
|
|
||||||
|
The Prometheus configuration file contains the definitions for scrape targets
|
||||||
|
and the location of the rules files for triggering alerts on scraped metrics.
|
||||||
|
The configuration file is defined in the values file, and can be found at:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
conf:
|
||||||
|
prometheus:
|
||||||
|
scrape_configs: |
|
||||||
|
|
||||||
|
By defining the configuration via the values file, an operator can override all
|
||||||
|
configuration components of the Prometheus deployment at runtime.
|
||||||
|
|
||||||
|
Kubernetes Endpoint Configuration
|
||||||
|
---------------------------------
|
||||||
|
|
||||||
|
The Prometheus chart in openstack-helm-infra uses the built-in service discovery
|
||||||
|
mechanisms for Kubernetes endpoints and pods to automatically configure scrape
|
||||||
|
targets. Functions added to helm-toolkit allows configuration of these targets
|
||||||
|
via annotations that can be applied to any service or pod that exposes metrics
|
||||||
|
for Prometheus, whether a service for an application-specific exporter or an
|
||||||
|
application that provides a metrics endpoint via its service. The values in
|
||||||
|
these functions correspond to entries in the monitoring tree under the
|
||||||
|
prometheus key in a chart's values.yaml file.
|
||||||
|
|
||||||
|
|
||||||
|
The functions definitions are below:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
{{- define "helm-toolkit.snippets.prometheus_service_annotations" -}}
|
||||||
|
{{- $config := index . 0 -}}
|
||||||
|
{{- if $config.scrape }}
|
||||||
|
prometheus.io/scrape: {{ $config.scrape | quote }}
|
||||||
|
{{- end }}
|
||||||
|
{{- if $config.scheme }}
|
||||||
|
prometheus.io/scheme: {{ $config.scheme | quote }}
|
||||||
|
{{- end }}
|
||||||
|
{{- if $config.path }}
|
||||||
|
prometheus.io/path: {{ $config.path | quote }}
|
||||||
|
{{- end }}
|
||||||
|
{{- if $config.port }}
|
||||||
|
prometheus.io/port: {{ $config.port | quote }}
|
||||||
|
{{- end }}
|
||||||
|
{{- end -}}
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
{{- define "helm-toolkit.snippets.prometheus_pod_annotations" -}}
|
||||||
|
{{- $config := index . 0 -}}
|
||||||
|
{{- if $config.scrape }}
|
||||||
|
prometheus.io/scrape: {{ $config.scrape | quote }}
|
||||||
|
{{- end }}
|
||||||
|
{{- if $config.path }}
|
||||||
|
prometheus.io/path: {{ $config.path | quote }}
|
||||||
|
{{- end }}
|
||||||
|
{{- if $config.port }}
|
||||||
|
prometheus.io/port: {{ $config.port | quote }}
|
||||||
|
{{- end }}
|
||||||
|
{{- end -}}
|
||||||
|
|
||||||
|
These functions render the following annotations:
|
||||||
|
|
||||||
|
- prometheus.io/scrape: Must be set to true for Prometheus to scrape target
|
||||||
|
- prometheus.io/scheme: Overrides scheme used to scrape target if not http
|
||||||
|
- prometheus.io/path: Overrides path used to scrape target metrics if not /metrics
|
||||||
|
- prometheus.io/port: Overrides port to scrape metrics on if not service's default port
|
||||||
|
|
||||||
|
Each chart that can be targeted for monitoring by Prometheus has a prometheus
|
||||||
|
section under a monitoring tree in the chart's values.yaml, and Prometheus
|
||||||
|
monitoring is disabled by default for those services. Example values for the
|
||||||
|
required entries can be found in the following monitoring configuration for the
|
||||||
|
prometheus-node-exporter chart:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
monitoring:
|
||||||
|
prometheus:
|
||||||
|
enabled: false
|
||||||
|
node_exporter:
|
||||||
|
scrape: true
|
||||||
|
|
||||||
|
If the prometheus.enabled key is set to true, the annotations are set on the
|
||||||
|
targeted service or pod as the condition for applying the annotations evaluates
|
||||||
|
to true. For example:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
{{- $prometheus_annotations := $envAll.Values.monitoring.prometheus.node_exporter }}
|
||||||
|
---
|
||||||
|
apiVersion: v1
|
||||||
|
kind: Service
|
||||||
|
metadata:
|
||||||
|
name: {{ tuple "node_metrics" "internal" . | include "helm-toolkit.endpoints.hostname_short_endpoint_lookup" }}
|
||||||
|
labels:
|
||||||
|
{{ tuple $envAll "node_exporter" "metrics" | include "helm-toolkit.snippets.kubernetes_metadata_labels" | indent 4 }}
|
||||||
|
annotations:
|
||||||
|
{{- if .Values.monitoring.prometheus.enabled }}
|
||||||
|
{{ tuple $prometheus_annotations | include "helm-toolkit.snippets.prometheus_service_annotations" | indent 4 }}
|
||||||
|
{{- end }}
|
||||||
|
|
||||||
|
Kubelet, API Server, and cAdvisor
|
||||||
|
---------------------------------
|
||||||
|
|
||||||
|
The Prometheus chart includes scrape target configurations for the kubelet, the
|
||||||
|
Kubernetes API servers, and cAdvisor. These targets are configured based on
|
||||||
|
a kubeadm deployed Kubernetes cluster, as OpenStack-Helm uses kubeadm to deploy
|
||||||
|
Kubernetes in the gates. These configurations may need to change based on your
|
||||||
|
chosen method of deployment. Please note the cAdvisor metrics will not be
|
||||||
|
captured if the kubelet was started with the following flag:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
--cadvisor-port=0
|
||||||
|
|
||||||
|
To enable the gathering of the kubelet's custom metrics, the following flag must
|
||||||
|
be set:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
--enable-custom-metrics
|
||||||
|
|
||||||
|
Installation
|
||||||
|
------------
|
||||||
|
|
||||||
|
The Prometheus chart can be installed with the following command:
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
helm install --namespace=openstack local/prometheus --name=prometheus
|
||||||
|
|
||||||
|
The above command results in a Prometheus deployment configured to automatically
|
||||||
|
discover services with the necessary annotations for scraping, configured to
|
||||||
|
gather metrics on the kubelet, the Kubernetes API servers, and cAdvisor.
|
||||||
|
|
||||||
|
Extending Prometheus
|
||||||
|
--------------------
|
||||||
|
|
||||||
|
Prometheus can target various exporters to gather metrics related to specific
|
||||||
|
applications to extend visibility into an OpenStack-Helm deployment. Currently,
|
||||||
|
openstack-helm-infra contains charts for:
|
||||||
|
|
||||||
|
- prometheus-kube-state-metrics: Provides additional Kubernetes metrics
|
||||||
|
- prometheus-node-exporter: Provides metrics for nodes and linux kernels
|
||||||
|
- prometheus-openstack-metrics-exporter: Provides metrics for OpenStack services
|
||||||
|
|
||||||
|
Kube-State-Metrics
|
||||||
|
~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
The prometheus-kube-state-metrics chart provides metrics for Kubernetes objects
|
||||||
|
as well as metrics for kube-scheduler and kube-controller-manager. Information
|
||||||
|
on the specific metrics available via the kube-state-metrics service can be
|
||||||
|
found in the kube-state-metrics_ documentation.
|
||||||
|
|
||||||
|
The prometheus-kube-state-metrics chart can be installed with the following:
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
helm install --namespace=kube-system local/prometheus-kube-state-metrics --name=prometheus-kube-state-metrics
|
||||||
|
|
||||||
|
.. _kube-state-metrics: https://github.com/kubernetes/kube-state-metrics/tree/master/Documentation
|
||||||
|
|
||||||
|
Node Exporter
|
||||||
|
~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
The prometheus-node-exporter chart provides hardware and operating system metrics
|
||||||
|
exposed via Linux kernels. Information on the specific metrics available via
|
||||||
|
the Node exporter can be found on the Node_exporter_ GitHub page.
|
||||||
|
|
||||||
|
The prometheus-node-exporter chart can be installed with the following:
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
helm install --namespace=kube-system local/prometheus-node-exporter --name=prometheus-node-exporter
|
||||||
|
|
||||||
|
.. _Node_exporter: https://github.com/prometheus/node_exporter
|
||||||
|
|
||||||
|
OpenStack Exporter
|
||||||
|
~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
The prometheus-openstack-exporter chart provides metrics specific to the
|
||||||
|
OpenStack services. The exporter's source code can be found here_. While the
|
||||||
|
metrics provided are by no means comprehensive, they will be expanded upon.
|
||||||
|
|
||||||
|
Please note the OpenStack exporter requires the creation of a Keystone user to
|
||||||
|
successfully gather metrics. To create the required user, the chart uses the
|
||||||
|
same keystone user management job the OpenStack service charts use.
|
||||||
|
|
||||||
|
The prometheus-openstack-exporter chart can be installed with the following:
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
helm install --namespace=openstack local/prometheus-openstack-exporter --name=prometheus-openstack-exporter
|
||||||
|
|
||||||
|
.. _here: https://github.com/att-comdev/openstack-metrics-collector
|
||||||
|
|
||||||
|
Other exporters
|
||||||
|
~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
Certain charts in OpenStack-Helm include templates for application-specific
|
||||||
|
Prometheus exporters, which keeps the monitoring of those services tightly coupled
|
||||||
|
to the chart. The templates for these exporters can be found in the monitoring
|
||||||
|
subdirectory in the chart. These exporters are disabled by default, and can be
|
||||||
|
enabled by setting the appropriate flag in the monitoring.prometheus key of the
|
||||||
|
chart's values.yaml file. The charts containing exporters include:
|
||||||
|
|
||||||
|
- Elasticsearch_
|
||||||
|
- RabbitMQ_
|
||||||
|
- MariaDB_
|
||||||
|
- Memcached_
|
||||||
|
- Fluentd_
|
||||||
|
- Postgres_
|
||||||
|
|
||||||
|
.. _Elasticsearch: https://github.com/justwatchcom/elasticsearch_exporter
|
||||||
|
.. _RabbitMQ: https://github.com/kbudde/rabbitmq_exporter
|
||||||
|
.. _MariaDB: https://github.com/prometheus/mysqld_exporter
|
||||||
|
.. _Memcached: https://github.com/prometheus/memcached_exporter
|
||||||
|
.. _Fluentd: https://github.com/V3ckt0r/fluentd_exporter
|
||||||
|
.. _Postgres: https://github.com/wrouesnel/postgres_exporter
|
||||||
|
|
||||||
|
Ceph
|
||||||
|
~~~~
|
||||||
|
|
||||||
|
Starting with Luminous, Ceph can export metrics with ceph-mgr prometheus module.
|
||||||
|
This module can be enabled in Ceph's values.yaml under the ceph_mgr_enabled_plugins
|
||||||
|
key by appending prometheus to the list of enabled modules. After enabling the
|
||||||
|
prometheus module, metrics can be scraped on the ceph-mgr service endpoint. This
|
||||||
|
relies on the Prometheus annotations attached to the ceph-mgr service template, and
|
||||||
|
these annotations can be modified in the endpoints section of Ceph's values.yaml
|
||||||
|
file. Information on the specific metrics available via the prometheus module
|
||||||
|
can be found in the Ceph prometheus_ module documentation.
|
||||||
|
|
||||||
|
.. _prometheus: http://docs.ceph.com/docs/master/mgr/prometheus/
|
||||||
|
|
||||||
|
|
||||||
|
Prometheus Dashboard
|
||||||
|
--------------------
|
||||||
|
|
||||||
|
Prometheus includes a dashboard that can be accessed via the accessible
|
||||||
|
Prometheus endpoint (NodePort or otherwise). This dashboard will give you a
|
||||||
|
view of your scrape targets' state, the configuration values for Prometheus's
|
||||||
|
scrape jobs and command line flags, a view of any alerts triggered based on the
|
||||||
|
defined rules, and a means for using PromQL to query scraped metrics. The
|
||||||
|
Prometheus dashboard is a useful tool for verifying Prometheus is configured
|
||||||
|
appropriately and to verify the status of any services targeted for scraping via
|
||||||
|
the Prometheus service discovery annotations.
|
||||||
|
|
||||||
|
Rules Configuration
|
||||||
|
-------------------
|
||||||
|
|
||||||
|
Prometheus provides a querying language that can operate on defined rules which
|
||||||
|
allow for the generation of alerts on specific metrics. The Prometheus chart in
|
||||||
|
openstack-helm-infra defines these rules via the values.yaml file. By defining
|
||||||
|
these in the values file, it allows operators flexibility to provide specific
|
||||||
|
rules via overrides at installation. The following rules keys are provided:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
values:
|
||||||
|
conf:
|
||||||
|
rules:
|
||||||
|
alertmanager:
|
||||||
|
etcd3:
|
||||||
|
kube_apiserver:
|
||||||
|
kube_controller_manager:
|
||||||
|
kubelet:
|
||||||
|
kubernetes:
|
||||||
|
rabbitmq:
|
||||||
|
mysql:
|
||||||
|
ceph:
|
||||||
|
openstack:
|
||||||
|
custom:
|
||||||
|
|
||||||
|
These provided keys provide recording and alert rules for all infrastructure
|
||||||
|
components of an OpenStack-Helm deployment. If you wish to exclude rules for a
|
||||||
|
component, leave the tree empty in an overrides file. To read more
|
||||||
|
about Prometheus recording and alert rules definitions, please see the official
|
||||||
|
Prometheus recording_ and alert_ rules documentation.
|
||||||
|
|
||||||
|
.. _recording: https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/
|
||||||
|
.. _alert: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
|
||||||
|
|
||||||
|
Note: Prometheus releases prior to 2.0 used gotpl to define rules. Prometheus
|
||||||
|
2.0 changed the rules format to YAML, making them much easier to read. The
|
||||||
|
Prometheus chart in openstack-helm-infra uses Prometheus 2.0 by default to take
|
||||||
|
advantage of changes to the underlying storage layer and the handling of stale
|
||||||
|
data. The chart will not support overrides for Prometheus versions below 2.0,
|
||||||
|
as the command line flags for the service changed between versions.
|
||||||
|
|
||||||
|
The wide range of exporters included in OpenStack-Helm coupled with the ability
|
||||||
|
to define rules with configuration overrides allows for the addition of custom
|
||||||
|
alerting and recording rules to fit an operator's monitoring needs. Adding new
|
||||||
|
rules or modifying existing rules require overrides for either an existing key
|
||||||
|
under conf.rules or the addition of a new key under conf.rules. The addition
|
||||||
|
of custom rules can be used to define complex checks that can be extended for
|
||||||
|
determining the liveliness or health of infrastructure components.
|
1
doc/source/readme.rst
Normal file
1
doc/source/readme.rst
Normal file
@ -0,0 +1 @@
|
|||||||
|
.. include:: ../../README.rst
|
Loading…
Reference in New Issue
Block a user