Merge "Add general Logging/Monitoring/Alerting spec"
This commit is contained in:
commit
1a26a343a8
|
@ -0,0 +1,297 @@
|
|||
==========================================================
|
||||
General Logging, Monitoring, Alerting architecture for MCP
|
||||
==========================================================
|
||||
|
||||
[No Jira Epic for this spec]
|
||||
|
||||
This specification describes the general Logging, Monitoring, Alerting
|
||||
architecture for MCP (Mirantis Cloud Platform).
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
Logging, Monitoring and Alerting are key aspects which need to be taken into
|
||||
account from the very beginning of the MCP project.
|
||||
|
||||
This specification just describes the general architecture for Logging,
|
||||
Monitoring and Alerting. Details on the different parts will be provided with
|
||||
more specific specifications.
|
||||
|
||||
In the rest of the document we will use LMA to refer to Logging, Monitoring and
|
||||
Alerting.
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
|
||||
The final goal is to provide tools to help OpenStack Operator diagnose and
|
||||
troubleshoot problems.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
We propose to add LMA components to MCP. The proposed software and architecture
|
||||
are based on the current Fuel StackLight product (composed of four Fuel
|
||||
plugins), with adjustements and improvements to meet the requirement of MCP
|
||||
(Mirantis Cloud Platform).
|
||||
|
||||
General Architecture
|
||||
--------------------
|
||||
|
||||
The following diagram describes the general architecture::
|
||||
|
||||
OpenStack nodes
|
||||
+-------------------+
|
||||
| +-------------------+
|
||||
| | +----+ |
|
||||
| | Logs+-+ +-+Snap| | +-------------+
|
||||
| | | | +----+ | | |
|
||||
| | +v--v+ | +------>Elasticsearch|
|
||||
| | |Heka+--------------+ | |
|
||||
+-+ +----+ | | +-------------+
|
||||
+-------------------+ |
|
||||
| +-------------+
|
||||
| | |
|
||||
+------+InfluxDB |
|
||||
k8s master node | | |
|
||||
+-------------------+ | +-------------+
|
||||
| +-------------------+ |
|
||||
| | +----+ | | +-------------+
|
||||
| | Logs+-+ +-+Snap| | +--+ | |
|
||||
| | | | +----+ | | +------>Nagios |
|
||||
| | +v--v+ | | | |
|
||||
| | |Heka+-----------+ +-------------+
|
||||
+-+ +----+ |
|
||||
+-------------------+
|
||||
|
||||
|
||||
The boxes on the top-left corner of the diagram represent the nodes where the
|
||||
OpenStack services run. The boxes on the bottom-left corner of the diagram
|
||||
represent the the nodes where the Kubernetes infrastructure services run. The
|
||||
boxes on the right of the diagram represent the nodes where the LMA backends
|
||||
are run.
|
||||
|
||||
Each node runs two services: Heka and Snap. Although it is not depicted in the
|
||||
diagram Heka and Snap also run on the backend nodes, where we also want to
|
||||
collect logs and telemetry data.
|
||||
|
||||
`Snap`_ is the telemetry framework created by Intel that we will use in MCP for
|
||||
collecting telemetry data (CPU usage, etc.). The current StackLight product
|
||||
uses Collectd instead of Snap, so this is an area where StackLight and MCP will
|
||||
differ. The telemetry data collected by Snap will be sent to Heka.
|
||||
|
||||
`Heka`_ is a stream processing software created and maintained by Mozilla. We
|
||||
will use Heka for collecting logs and notifications, deriving new metrics from
|
||||
the telemetry data received from Snap, and sending the results to
|
||||
Elasticsearch, InfluxDB and Nagios.
|
||||
|
||||
`Elasticsearch`_ will be used for indexing logs and notifications. And
|
||||
`Kibana`_ will be used for visualizing the data indexed in Elasticsearch.
|
||||
Default Kibana dashboards will be shipped in MCP.
|
||||
|
||||
`InfluxDB`_ is a database optimized for time-series. It will be used for
|
||||
storing the telemetry data. And `Grafana`_ will be used for visualizing the
|
||||
telemetry data stored in InfluxDB. Default Grafana dashboards will be shipped
|
||||
in MCP.
|
||||
|
||||
`Nagios`_ is a feature-full monitoring software. In MCP we may use it for
|
||||
handling status messages sent by Heka and reporting on the current status of
|
||||
nodes and services. For that Nagios's `Passive Checks`_ would be used. We've
|
||||
been looking at alternatives such as `Sensu`_ and `Icinga`_ , but until now we
|
||||
haven't found something with the level of functionality of Nagios. Another
|
||||
alternative is to just rely on Heka's `Alert module`_ and `SMTPOutput plugin`_
|
||||
for notifications. Whether Nagios will be used or not in MCP will be discussed
|
||||
with a more specific specification. It is also to be noted that Alerting should
|
||||
be an optional part of the monitoring sytem in MCP.
|
||||
|
||||
.. _Snap: https://github.com/intelsdi-x/snap
|
||||
.. _Heka: http://hekad.readthedocs.org/
|
||||
.. _Elasticsearch: https://www.elastic.co/products/elasticsearch
|
||||
.. _Kibana: https://www.elastic.co/products/kibana
|
||||
.. _InfluxDB: https://influxdata.com/
|
||||
.. _Grafana: http://grafana.org/
|
||||
.. _Nagios: https://www.nagios.org/
|
||||
.. _Passive Checks: https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/3/en/passivechecks.html
|
||||
.. _Sensu: https://sensuapp.org/
|
||||
.. _Icinga: https://www.icinga.org/
|
||||
.. _Alert module: http://hekad.readthedocs.io/en/latest/sandbox/index.html#alert-module
|
||||
.. _SMTPOutput plugin: http://hekad.readthedocs.io/en/latest/config/outputs/smtp.html
|
||||
|
||||
Kubernetes Logging and Monitoring
|
||||
---------------------------------
|
||||
|
||||
Kubernetes comes with its own monitoring and logging stack, so the question of
|
||||
what we will use and not use of this stack should be raised. This section
|
||||
discusses that.
|
||||
|
||||
Monitoring
|
||||
~~~~~~~~~~
|
||||
|
||||
`Kubernetes Monitoring`_ uses `Heapster`_. Heapster runs as a pod on a node of
|
||||
the Kubernetes cluster. Heapster gets container statistics by querying the
|
||||
cluster nodes' Kubelets. The Kubelet itself fetches the data from cAdvisor.
|
||||
Heapster groups the information by pod and sends the data to a backend for
|
||||
storage and visualization (InfluxDB is supported).
|
||||
|
||||
Collecting container and pod statistics is necessary for MCP, but it's not
|
||||
sufficient. For example, we also want to collect OpenStack services, to be able
|
||||
to report on the health of the OpenStack services that run on the cluster.
|
||||
Also, Heapster does not currently support any form of alerting.
|
||||
|
||||
The proposal is to use Snap en each node (see the previous section). Snap
|
||||
already includes `plugins for OpenStack`_. For container statistics the `Docker
|
||||
plugin`_ may be used, and, if necessary, a Kubernetes/Kubelet-specific Snap
|
||||
plugin may be developed.
|
||||
|
||||
Relying on Snap on each node, instead of a centralized Heapster instance, will
|
||||
also result in a more scalable solution.
|
||||
|
||||
However, it is to be noted that `Kubernetes Autoscaling`_ currently requires
|
||||
Heapster. This means that Heapster must be used if the Autoscaling
|
||||
functionality is required for MCP. But in that case, no storage backend should
|
||||
be set in the Heapster configuration, as Heapster will just be used for the
|
||||
Autoscaling functionality.
|
||||
|
||||
.. _Kubernetes Monitoring: http://kubernetes.io/docs/user-guide/monitoring/
|
||||
.. _Heapster: https://github.com/kubernetes/heapster
|
||||
.. _plugins for OpenStack: https://github.com/intelsdi-x?utf8=%E2%9C%93&query=snap-plugin-collector
|
||||
.. _Docker plugin: https://github.com/intelsdi-x/snap-plugin-collector-docker
|
||||
.. _Kubernetes Autoscaling: http://kubernetes.io/docs/user-guide/horizontal-pod-autoscaling/
|
||||
|
||||
Logging
|
||||
~~~~~~~
|
||||
|
||||
`Kubernetes Logging`_ relies on Fluentd, with a Fluentd agent running on each
|
||||
node. That agent collects container logs (through the Docker Engine running on
|
||||
the node) and sends them to Google Cloud Logging or Elasticsearch (the backend
|
||||
used is pecified through the ``KUBE_LOGGING_DESTINATION`` variable).
|
||||
|
||||
The main problem with this solution is our inability to act on the logs before
|
||||
they're stored into Elasticsearch. For instance we want to be able to monitor
|
||||
tho logs, to be able to detect spikes of errors. We also want to be able to
|
||||
derive metrics from logs, such as HTTP response time metrics. Also, we may want
|
||||
to use Kafka in the future (see below). In summary, Kubernetes Logging does not
|
||||
provide us with the flexibility we need.
|
||||
|
||||
Our proposal is to use Heka instead of Fluentd. The benefits are:
|
||||
|
||||
* Flexibility (e.g. use Kafka between Heka and Elasticsearch in the future).
|
||||
* Be able to collect logs from services that can't log to stdout.
|
||||
* Team's experience on using Heka and running it in production.
|
||||
* Re-use all the Heka plugins we've developed (parsers for OpenStack logs, log
|
||||
monitoring filters, etc.).
|
||||
|
||||
.. _Kubernetes Logging: http://kubernetes.io/docs/getting-started-guides/logging/
|
||||
|
||||
Use Kafka
|
||||
---------
|
||||
|
||||
Another component that we're considering introducing is `Apache Kafka`_. Kafka
|
||||
will sit between Heka and the backends, and it will be used as a robust and
|
||||
scalable messaging system for the communications between the Heka instances and
|
||||
the backends. Heka has the capability of buffering messaging, but we believe
|
||||
that Kafka would allow for a more robust and resilient system. We may make
|
||||
Kafka optional, but highly recommended for medium and large clusters.
|
||||
|
||||
The following diagram depicts the architecture when Kafka is used:
|
||||
|
||||
OpenStack nodes
|
||||
+-------------------+
|
||||
| +-------------------+ Kafka cluster
|
||||
| | +----+ | +-------+ +-------------+
|
||||
| | Logs+-+ +-+Snap| | | | | |
|
||||
| | | | +----+ | +--+Kafka +--+ +----->Elasticsearch|
|
||||
| | +v--v+ | | | | | | | |
|
||||
| | |Heka+--------------> +-------+ +----+ +-------------+
|
||||
+-+ +----+ | | |
|
||||
+-------------------+ | +-------+ | +-------------+
|
||||
| | | | | |
|
||||
+--+Kafka +--+ +--->InfluxDB |
|
||||
| | | | | | |
|
||||
k8s master nodes | +-------+ +------+ +-------------+
|
||||
+-------------------+ | |
|
||||
| +-------------------+ +--> +-------+ +----+ +-------------+
|
||||
| | +----+ | | | | | | | | |
|
||||
| | Logs+-+ +-+Snap| | | +--+Kafka +--+ +----->Nagios |
|
||||
| | | | +----+ | | | | | |
|
||||
| | +v--v+ | | +-------+ +-------------+
|
||||
| | |Heka+-----------+
|
||||
+-+ +----+ |
|
||||
+-------------------+
|
||||
|
||||
The Heka instances running on the OpenStack and Kubernetes nodes are Kafka
|
||||
producers. Although not depicted on the diagram Heka instances will also
|
||||
probably be used as Kafka consumers between the Kafka cluster and the backends.
|
||||
We will need to run performance tests to determine if Heka will be able to keep
|
||||
up with the load when used as a Kafka consumer.
|
||||
|
||||
A specific specification will be written for the introduction of Kafka.
|
||||
|
||||
.. _Apache Kafka: https://kafka.apache.org/
|
||||
|
||||
Packaging and deployment
|
||||
------------------------
|
||||
|
||||
All the services participating to the LMA architecture will run in Docker
|
||||
containers, following the MCP approach to packaging and service execution.
|
||||
|
||||
Relying on `Kubernetes Daemon Sets`_ for deploying Heka and Snap on all the
|
||||
cluster nodes sounds like a good approach. The Kubernetes doc even mentions
|
||||
logstash and collectd as a good candidates for running as Daemon Sets.
|
||||
|
||||
.. _Kubernetes Daemon Sets: http://kubernetes.io/docs/admin/daemons/
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
The possible alternatives will be discussed in more specific specifications.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
The implementation will be described in more specific specifications.
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
elemoine (elemoine@mirantis.com)
|
||||
|
||||
Other contributors:
|
||||
obourdon (obourdon@mirantis.com)
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
Other specification documents will be written:
|
||||
|
||||
* Logging with Heka
|
||||
* Logs storage and analytics with Elasticsearch and Kibana
|
||||
* Monitoring with Snap
|
||||
* Metrics storage and analytics with InfluxDB and Grafana
|
||||
* Alerting in MCP
|
||||
* Introducing Kafka to the MCP Monitoring stack
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None.
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
The testing strategy will be described in more specific specifications.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
The MCP monitoring system will be documented.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
None.
|
||||
|
||||
History
|
||||
=======
|
||||
|
||||
None.
|
Loading…
Reference in New Issue