From ccf7aaff6f79ebabacb015186ecf9f75c0517111 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=C3=89ric=20Lemoine?= <elemoine@mirantis.com>
Date: Fri, 1 Apr 2016 15:23:13 +0200
Subject: [PATCH] Add logging with Heka spec

Change-Id: I278888696f54a813cdf4b8535ddd4b8af7fe318d
---
 specs/logging-with-heka.rst | 229 ++++++++++++++++++++++++++++++++++++
 1 file changed, 229 insertions(+)
 create mode 100644 specs/logging-with-heka.rst

diff --git a/specs/logging-with-heka.rst b/specs/logging-with-heka.rst
new file mode 100644
index 0000000..cf96af6
--- /dev/null
+++ b/specs/logging-with-heka.rst
@@ -0,0 +1,229 @@
+=================
+Logging with Heka
+=================
+
+[No Jira Epic for this spec]
+
+This specification describes the logging system to implement in Mirantis Cloud
+Platform (MCP). It is complementary to the *General Logging, Monitoring,
+Alerting architecture for MCP* specification.
+
+That system is based on `Heka`_, `Elasticsearch`_ and `Kibana`_.
+
+.. _Heka: http://hekad.readthedocs.org/
+.. _Elasticsearch: https://www.elastic.co/products/elasticsearch
+.. _Kibana: https://www.elastic.co/products/kibana
+
+Problem description
+===================
+
+It is important that MCP comes with a robust and scalable logging system. This
+specification describes the logging system we want to implement in MCP. It is
+based on StackLight's logging solution for MOS.
+
+Use Cases
+---------
+
+The target for the logging system is the Operator of Kubernetes cluster. The
+Operator will use Kibana to view and search logs, with dashboards providing
+statistics views of the logs.
+
+We also want to be able to derive metrics from logs and monitor logs to detect
+spike of errors for example. The solution described in this specification makes
+this possible, but the details of log monitoring will be covered with
+a separate specification.
+
+The deployment and configuration of Elasticsearch and Kibana through Kubernetes
+will also be described with a separate specification.
+
+Proposed change
+===============
+
+Architecture
+------------
+
+This is the architecture::
+
+      Cluster nodes
+    +---------------+                        +----------------+
+    | +---------------+                      | +----------------+
+    | | +---------------+                    | | +----------------+
+    | | |               |                    | | |                |
+    | | | Logs+-+       |                    | | |                |
+    | | |       |       |                    | | |                |
+    | | |       |       |                    | | | Elasticsearch  |
+    | | |    +--v-+     |                    | | |                |
+    | | |    |Heka+--------------------------> | |                |
+    +-+ |    +----+     |                    +-+ |                |
+      +-+               |                      +-+                |
+        +---------------+                        +----------------+
+
+In this architecture Heka runs on every node of the Kubernetes cluster. It runs
+in a dedicated container, referred to as the *Heka container* in the rest of
+this document.
+
+Each Heka instance reads and processes the logs local to the node it runs on,
+and sends these logs to Elasticsearch for indexing. Elasticsearch may be
+distributed on multiple nodes for resiliency and scalability, but this topic is
+outside the scope of that specification.
+
+Heka, written in Go, is fast and has a small footprint, making it possible to
+run it on every node of the cluster and effectively distribute the log
+processing load.
+
+Another important aspect is flow control and avoiding the loss of log messages
+in case of overload. Heka’s filter and output plugins, and the Elasticsearch
+output plugin in particular, support the use of a disk based message queue.
+This message queue allows plugins to reprocess messages from the queue when
+downstream servers (Elasticsearch) are down or cannot keep up with the data
+flow.
+
+Rely on Docker Logging
+----------------------
+
+Based on `discussions`_ with the Mirantis architects and experience gained with
+the Kolla project the plan is to rely on `Docker Logging`_ and Heka's
+`DockerLogInput plugin`_.
+
+Since the `Kolla logging specification`_ was written the support for Docker
+Logging has improved in Heka. More specifically Heka is now able to collect
+logs that were created while Heka wasn't running.
+
+Things to note:
+
+* When ``DockerLogInput`` is used there is no way to differentiate log messages
+  for containers producing multiple log streams – containers running multiple
+  processes/agents for example. So some other technique will have to be used
+  for containers producing multiple log streams. One technique involves using
+  log files and Docker volumes, which is the technique currently used in Kolla.
+  Another technique involves having services use Syslog and have Heka act as
+  a Syslog server for these services.
+
+* We will also probably encounter services that cannot be configured to log to
+  ``stdout``. So again, we will have to resort to using some other technique
+  for these services. Log files or Syslog can be used, as described previously.
+
+* Past experiments have shown that the OpenStack logs written to ``stdout`` are
+  visible to neither Heka nor ``docker logs``.  This problem does not exist
+  when ``stderr`` is used rather than ``stdout``.  The cause of this problem is
+  currently unknown.
+
+* ``DockerLogInput`` relies on Docker's `Get container logs endpoint`_, which
+  works only for containers with the ``json-file`` or ``journald`` logging
+  drivers. This means the Docker daemon cannot be configured with another
+  logging driver than ``json-file`` or ``journald``.
+
+* If the ``json-file`` logging driver is used then the ``max-size`` and
+  ``max-file`` options should be set, for containers logs to be rolled over as
+  appropriate. These options are not set by default in Ubuntu (in neither 14.04
+  nor 16.04).
+
+.. _discussions: https://docs.google.com/document/d/15QYIX_cggbDH2wAJ6-7xUfmyZ3Izy_MOasVACutwqkE
+.. _Docker Logging: https://docs.docker.com/engine/admin/logging/overview/
+.. _DockerLogInput plugin: http://hekad.readthedocs.org/en/v0.10.0/config/inputs/docker_log.html
+.. _Kolla logging specification: https://github.com/openstack/kolla/blob/master/specs/logging-with-heka.rst
+.. _Get container logs endpoint: https://docs.docker.com/engine/reference/api/docker_remote_api_v1.20/#get-container-logs
+
+Read Python Tracebacks
+----------------------
+
+In case of exceptions the OpenStack services log Python Tracebacks as multiple
+log messages. If no special care is taken then the Python Tracebacks will be
+indexed as separate documents in Elasticsearch, and displayed as distinct log
+entries in Kibana, making them hard to read.  To address that issue we will use
+a custom Heka decoder, which will be responsible for coalescing the log lines
+making up a Python Traceback into one message.
+
+Collect system logs
+-------------------
+
+In addition to container logs it is important to collect system logs as well.
+For that we propose to mount the host's ``/var/log`` directory into the Heka
+container (as ``/var/log-host/``), and configure Heka to get logs from standard
+log files located in that directory (e.g. ``kern.log``, ``auth.log``,
+``messages``).
+
+Create an ``heka`` user
+-----------------------
+
+For security reasons an ``heka`` user will be created in the Heka container and
+the ``hekad`` daemon will run under that user.
+
+Deployment
+----------
+
+Following the MCP approach to packaging and service execution the Heka daemon
+will run in a container. We plan to rely on Kubernetes's `Daemon Sets`_
+functionality for deploying Heka on all the Kubernetes nodes.
+
+We also want Heka to be deployed on the Kubernetes master node. For that the
+Kubernetes master node should also be a minion server, where Kubernetes may
+deploy containers.
+
+.. _Daemon Sets: http://kubernetes.io/docs/admin/daemons/
+
+Security impact
+---------------
+
+The security impact is minor, as Heka will not expose any network port to the
+outside. Also, Heka's "dynamic sandboxes" functionality will be disabled,
+eliminating the risk of injecting malicious code into the Heka pipeline.
+
+Performance Impact
+------------------
+
+The ``hekad`` daemon will run in a container on each cluster node. And we have
+assessed that Heka is lightweight enough to run on every node. See the
+`Introduction of Heka in Kolla`_ email sent to the openstack-dev mailing list
+for a discussion on comparison between Heka and Logstash. Also, a possible
+option would be to constrain the resources associated to the Heka container.
+
+.. _Introduction of Heka in Kolla: http://lists.openstack.org/pipermail/openstack-dev/2016-January/083751.html
+
+Alternatives
+------------
+
+An alternative to this proposal involves relying on Kubernetes Logging, i.e.
+use Kubernetes's native logging system. Some `research`_ has been done on
+Kubernetes Logging. The conclusion to this research is that Kubernetes Logging
+is not flexible enough, making it impossible to implement features such as
+log monitoring in the future.
+
+.. _research: https://mirantis.jira.com/wiki/display/NG/k8s+LMA+approaches
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Primary assignee:
+  Éric Lemoine (elemoine)
+
+Work Items
+----------
+
+1. Create an Heka Docker image
+2. Create some general Heka configuration
+3. Deploy Heka through Kubernetes
+4. Collect OpenStack logs
+5. Collect other services' logs (RabbitMQ, MySQL...)
+6. Collect Kubernetes logs
+7. Send logs to Elasticsearch
+
+Testing
+=======
+
+We will add functional tests that verify that the Heka chain works for all the
+service and system logs Heka collects. These tests will be executed as part of
+the gating process.
+
+Documentation Impact
+====================
+
+None.
+
+References
+==========
+
+None.