rsyslog remote logging

BP https://blueprints.launchpad.net/tripleo/+spec/remote-logging Parent BP https://blueprints.launchpad.net/tripleo/+spec/logging-stdout-rsyslog This spec handles remote logging using rsyslog and tries to define useful conventions for storing and processing logs iwth TripleO. Change-Id: I5402a86388a425df181d58ff0850a3e2d4444c9d
2017-11-28 12:59:24 -05:00 · 2017-11-28 12:59:24 -05:00 · 7d9010c3e4
commit 7d9010c3e4
parent d0537d9f89
1 changed files with 276 additions and 0 deletions
--- a/specs/rocky/tripleo-rsyslog-remote-logging.rst
+++ b/specs/rocky/tripleo-rsyslog-remote-logging.rst
@ -0,0 +1,276 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+==========================================
+        TripleO Remote Logging
+==========================================
+
+https://blueprints.launchpad.net/tripleo/+spec/remote-logging
+
+This spec is meant to extend the tripleo-logging spec also for queens to
+address key issues about log transport and storage that are separate from
+the technical requirements created by logging for containerized processes.
+
+Problem Description
+===================
+
+Having logs stuck on individual overcloud nodes isn't a workable solution
+for a modern system deployed at scale. But log aggregation is complex both
+to implement and to scale. TripleO should provide a robust, well documented,
+and scalable solution that will serve the majority of users needs and be
+easily extensible for others.
+
+
+Proposed Change
+===============
+
+Overview
+--------
+
+In addition to the rsyslog logging to stdout defined for containers in the
+triple-logging spec this spec outlines how logging to remote targets should
+work in detail.
+
+Essentially this comes down to a set of options for the config
+of the rsyslog container. Other services will have a fixed rsyslog config
+that forwards messages to the rsyslog container to pick up over journald.
+
+1. Logging destination, local, remote direct, or remote aggregator.
+
+Remote direct means to go direct to a storage solution, in this case
+Elasticsearch or plaintext on the disk. Remote aggregator is a design where
+the processing, formatting, and insertion of the logs is a task left to the
+aggregator server. Using aggregators it's possible to scale log collection to
+hundreds of overcloud nodes without overwhelming the storage backend with
+inefficient connections.
+
+2. Log caching for remote targets
+
+In the case of remote targets a caching system can be setup, where logs are
+stored temporarily on the local machine in a configurable disk or memory cache
+until they can be uploaded to an aggregator or storage system. While some in
+memory cache is mandatory users may select a disk cache depending on how
+important it is that all logs be saved and stored. This allows recovery
+without loss of messages during network outages or service outages.
+
+
+3. Log security in transit
+
+In some cases encryption during transit may be required. rsyslog offers
+ssl based encryption that should be easily deployable.
+
+4. Standard and extensible format
+
+By default logs should be formatted as outlined by the Redhat common logging
+initiative. By standardizing logging format where possible various tools
+and analytics become more portable.
+
+Mandatory fields for this standard formatting include.
+
+version: the version of the logging template
+level: loglevel
+message: the log message
+tags: user specific tagging info
+
+Additional fields must be added in the format of
+
+<subject>.<subfield name>
+
+See an example by rsyslog for storage in Elasticsearch below.
+
+@timestamp 		November 27th 2017, 08:54:40.091
+@version 		2016.01.06-0
+_id 		AV_9wiWQzdGOuK5_zY5J
+_index 		logstash-2017.11.27.08
+_score
+_type 		rsyslog
+browbeat.cloud_name 		openstack-12-noncontainers-beta
+hostname 		lorenzo.perf.lab.eng.rdu.redhat.com
+level 		info
+message 		Stopping LVM2 PV scan on device 8:2...
+pid 		1
+rsyslog.appname 		systemd
+rsyslog.facility 		daemon
+rsyslog.fromhost-ip 		10.12.20.155
+rsyslog.inputname 		imptcp
+rsyslog.protocol-version 		1
+syslog.timegenerated 		November 27th 2017, 08:54:40.092
+systemd.t.BOOT_ID 		1e99848dbba047edaf04b150313f67a8
+systemd.t.CAP_EFFECTIVE 		1fffffffff
+systemd.t.CMDLINE 		/usr/lib/systemd/systemd --switched-root --system --deserialize 21
+systemd.t.COMM 		systemd
+systemd.t.EXE 		/usr/lib/systemd/systemd
+systemd.t.GID 		0
+systemd.t.MACHINE_ID 		0d7fed5b203f4664b0b4be90e4a8a992
+systemd.t.SELINUX_CONTEXT 		system_u:system_r:init_t:s0
+systemd.t.SOURCE_REALTIME_TIMESTAMP 		1511790880089672
+systemd.t.SYSTEMD_CGROUP 		/
+systemd.t.TRANSPORT 		journal
+systemd.t.UID 		0
+systemd.u.CODE_FILE 		src/core/unit.c
+systemd.u.CODE_FUNCTION 		unit_status_log_starting_stopping_reloading
+systemd.u.CODE_LINE 		1417
+systemd.u.MESSAGE_ID 		de5b426a63be47a7b6ac3eaac82e2f6f
+systemd.u.UNIT 		lvm2-pvscan@8:2.service
+tags
+
+As a visual aid here's a quick diagram of the flow of data.
+
+<rsyslog in process container> -> <journald> -> <rsyslog container> -> <rsyslog aggregator / Elasticsearch>
+
+In the process container logs from the application are packaged with metadata
+from systemd and other components depending on how rsyslog is configured,
+journald acts as a transport aggregating this input across all containers for
+the rsyslog container which formats this data into storable json and handles
+things like transforming fields and adding additional metadta as desired.
+Finally the data is inserted into elasticsearch or further held by an
+aggrebator for a few seconds before being bulk inserted into Elasticsearch.
+
+
+Alternatives
+------------
+
+TripleO already has some level of FluentD integration, but performance issues
+make it unusable at scale. Furthermore it's not well prepared for container
+logging.
+
+Ideally FluentD as a logging backend would be maintained, improved, and modified
+to use the common logging format for easy swapping of solutions.
+
+Security Impact
+---------------
+
+The security of remotely stored data and the log storage database is outside
+of the scope of this spec. The major remaining concerns are security in
+in transit and the changes required to systemd for rsyslog to send data
+remotely.
+
+A new systemd policy will have to be put into place to ensure that systemd
+can successfully log to remote targets. By default the syslog rules prevent
+any outside world access or port access, both of which are required for
+log forwarding.
+
+For log encryption in transit a ssl certificate will have to be generated and
+distributed to all nodes in the cloud securely, probably during deployment.
+Special care should be taken to ensure that any misconfigured instance of
+rsyslog without a certificate where one is required do not transmit logs
+by accident.
+
+
+Other End User Impact
+---------------------
+
+Ideally users will read some documentation and pass an extra 5-6 variables to
+TripleO to deploy with logging aggregation. It's very important that logging
+be easy to setup with sane defaults and no requirement on the user to implement
+their own formatting or template.
+
+Users may also have to setup a database for log storage and an aggregator if
+their deployment is large enough that they need one. Playbooks to do this
+automatically will be provided, but probably don't belong in TripleO.
+
+Special care will have to be taken to size storage and aggregation hardware
+to the task, while rsyslog is very efficient storage quickly becomes a problem
+when a cloud can generate 100gb of logs a day. Especially since log storage
+systems leave it up to the user to put in place rotation rules.
+
+
+Performance Impact
+------------------
+
+For small clouds rsyslog direct to Elasticsearch will perform just fine.
+As scale increases an aggregator (also running rsyslog, except configured
+to accept and format input) is required. I have yet to test a large enough
+cloud that an aggregator was at all stressed. Hundreds of gigs of logs a day
+are possible with a single 32gb ram VM as an Elastic instance.
+
+For the Overcloud nodes forwarding their logs the impact is variable depending
+on the users configuration. CPU requirements don't exceed single digits of a
+single core even under heavy load but storage requirements can balloon if a
+large on disk cache was specified and connectivity with the aggregator or
+database is lost for prolonged periods.
+
+Memory usage is no more than a few hundred mb and most of that is the default
+in memory log cache. Which once again could be expanded by the user.
+
+
+Other Deployer Impact
+---------------------
+
+N/A
+
+Developer Impact
+----------------
+
+N/A
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Who is leading the writing of the code? Or is this a blueprint where you're
+throwing it out there to see who picks it up?
+
+If more than one person is working on the implementation, please designate the
+primary author and contact.
+
+Primary assignee:
+  jkilpatr
+
+Other contributors:
+  jaosorior
+
+Work Items
+----------
+
+rsyslog container - jaosorior
+
+rsyslog templating and deployment role - jkilpatr
+
+aggregator and storage server deployment tooling - jkilpatr
+
+
+Dependencies
+============
+
+Blueprint dependencies:
+
+https://blueprints.launchpad.net/tripleo/+spec/logging-stdout-rsyslog
+
+Package dependencies:
+
+rsyslog, rsyslog-elasticsearch, rsyslog-mmjsonparse
+
+specifically version 8 of rsyslog, which is the earliest
+supported by rsyslog-elasticsearch, these are packaged in
+Centos and rhel 7.4 extras.
+
+Testing
+=======
+
+Logging aggregation can be tested in CI by deploying it during any existing CI job.
+
+For extra validation have a script to check the output into Elasticsearch.
+
+
+Documentation Impact
+====================
+
+Documentation will need to be written about the various modes and tunables for
+logging and how to deploy them. As well as sizing recommendations for the log
+storage system and aggregators where required.
+
+
+References
+==========
+
+https://review.openstack.org/#/c/490047/
+
+https://review.openstack.org/#/c/521083/
+
+https://blueprints.launchpad.net/tripleo/+spec/logging-stdout-rsyslog