rsyslog remote logging
BP https://blueprints.launchpad.net/tripleo/+spec/remote-logging Parent BP https://blueprints.launchpad.net/tripleo/+spec/logging-stdout-rsyslog This spec handles remote logging using rsyslog and tries to define useful conventions for storing and processing logs iwth TripleO. Change-Id: I5402a86388a425df181d58ff0850a3e2d4444c9d
This commit is contained in:
parent
d0537d9f89
commit
7d9010c3e4
276
specs/rocky/tripleo-rsyslog-remote-logging.rst
Normal file
276
specs/rocky/tripleo-rsyslog-remote-logging.rst
Normal file
@ -0,0 +1,276 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
==========================================
|
||||
TripleO Remote Logging
|
||||
==========================================
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+spec/remote-logging
|
||||
|
||||
This spec is meant to extend the tripleo-logging spec also for queens to
|
||||
address key issues about log transport and storage that are separate from
|
||||
the technical requirements created by logging for containerized processes.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
Having logs stuck on individual overcloud nodes isn't a workable solution
|
||||
for a modern system deployed at scale. But log aggregation is complex both
|
||||
to implement and to scale. TripleO should provide a robust, well documented,
|
||||
and scalable solution that will serve the majority of users needs and be
|
||||
easily extensible for others.
|
||||
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
In addition to the rsyslog logging to stdout defined for containers in the
|
||||
triple-logging spec this spec outlines how logging to remote targets should
|
||||
work in detail.
|
||||
|
||||
Essentially this comes down to a set of options for the config
|
||||
of the rsyslog container. Other services will have a fixed rsyslog config
|
||||
that forwards messages to the rsyslog container to pick up over journald.
|
||||
|
||||
1. Logging destination, local, remote direct, or remote aggregator.
|
||||
|
||||
Remote direct means to go direct to a storage solution, in this case
|
||||
Elasticsearch or plaintext on the disk. Remote aggregator is a design where
|
||||
the processing, formatting, and insertion of the logs is a task left to the
|
||||
aggregator server. Using aggregators it's possible to scale log collection to
|
||||
hundreds of overcloud nodes without overwhelming the storage backend with
|
||||
inefficient connections.
|
||||
|
||||
2. Log caching for remote targets
|
||||
|
||||
In the case of remote targets a caching system can be setup, where logs are
|
||||
stored temporarily on the local machine in a configurable disk or memory cache
|
||||
until they can be uploaded to an aggregator or storage system. While some in
|
||||
memory cache is mandatory users may select a disk cache depending on how
|
||||
important it is that all logs be saved and stored. This allows recovery
|
||||
without loss of messages during network outages or service outages.
|
||||
|
||||
|
||||
3. Log security in transit
|
||||
|
||||
In some cases encryption during transit may be required. rsyslog offers
|
||||
ssl based encryption that should be easily deployable.
|
||||
|
||||
4. Standard and extensible format
|
||||
|
||||
By default logs should be formatted as outlined by the Redhat common logging
|
||||
initiative. By standardizing logging format where possible various tools
|
||||
and analytics become more portable.
|
||||
|
||||
Mandatory fields for this standard formatting include.
|
||||
|
||||
version: the version of the logging template
|
||||
level: loglevel
|
||||
message: the log message
|
||||
tags: user specific tagging info
|
||||
|
||||
Additional fields must be added in the format of
|
||||
|
||||
<subject>.<subfield name>
|
||||
|
||||
See an example by rsyslog for storage in Elasticsearch below.
|
||||
|
||||
@timestamp November 27th 2017, 08:54:40.091
|
||||
@version 2016.01.06-0
|
||||
_id AV_9wiWQzdGOuK5_zY5J
|
||||
_index logstash-2017.11.27.08
|
||||
_score
|
||||
_type rsyslog
|
||||
browbeat.cloud_name openstack-12-noncontainers-beta
|
||||
hostname lorenzo.perf.lab.eng.rdu.redhat.com
|
||||
level info
|
||||
message Stopping LVM2 PV scan on device 8:2...
|
||||
pid 1
|
||||
rsyslog.appname systemd
|
||||
rsyslog.facility daemon
|
||||
rsyslog.fromhost-ip 10.12.20.155
|
||||
rsyslog.inputname imptcp
|
||||
rsyslog.protocol-version 1
|
||||
syslog.timegenerated November 27th 2017, 08:54:40.092
|
||||
systemd.t.BOOT_ID 1e99848dbba047edaf04b150313f67a8
|
||||
systemd.t.CAP_EFFECTIVE 1fffffffff
|
||||
systemd.t.CMDLINE /usr/lib/systemd/systemd --switched-root --system --deserialize 21
|
||||
systemd.t.COMM systemd
|
||||
systemd.t.EXE /usr/lib/systemd/systemd
|
||||
systemd.t.GID 0
|
||||
systemd.t.MACHINE_ID 0d7fed5b203f4664b0b4be90e4a8a992
|
||||
systemd.t.SELINUX_CONTEXT system_u:system_r:init_t:s0
|
||||
systemd.t.SOURCE_REALTIME_TIMESTAMP 1511790880089672
|
||||
systemd.t.SYSTEMD_CGROUP /
|
||||
systemd.t.TRANSPORT journal
|
||||
systemd.t.UID 0
|
||||
systemd.u.CODE_FILE src/core/unit.c
|
||||
systemd.u.CODE_FUNCTION unit_status_log_starting_stopping_reloading
|
||||
systemd.u.CODE_LINE 1417
|
||||
systemd.u.MESSAGE_ID de5b426a63be47a7b6ac3eaac82e2f6f
|
||||
systemd.u.UNIT lvm2-pvscan@8:2.service
|
||||
tags
|
||||
|
||||
As a visual aid here's a quick diagram of the flow of data.
|
||||
|
||||
<rsyslog in process container> -> <journald> -> <rsyslog container> -> <rsyslog aggregator / Elasticsearch>
|
||||
|
||||
In the process container logs from the application are packaged with metadata
|
||||
from systemd and other components depending on how rsyslog is configured,
|
||||
journald acts as a transport aggregating this input across all containers for
|
||||
the rsyslog container which formats this data into storable json and handles
|
||||
things like transforming fields and adding additional metadta as desired.
|
||||
Finally the data is inserted into elasticsearch or further held by an
|
||||
aggrebator for a few seconds before being bulk inserted into Elasticsearch.
|
||||
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
TripleO already has some level of FluentD integration, but performance issues
|
||||
make it unusable at scale. Furthermore it's not well prepared for container
|
||||
logging.
|
||||
|
||||
Ideally FluentD as a logging backend would be maintained, improved, and modified
|
||||
to use the common logging format for easy swapping of solutions.
|
||||
|
||||
Security Impact
|
||||
---------------
|
||||
|
||||
The security of remotely stored data and the log storage database is outside
|
||||
of the scope of this spec. The major remaining concerns are security in
|
||||
in transit and the changes required to systemd for rsyslog to send data
|
||||
remotely.
|
||||
|
||||
A new systemd policy will have to be put into place to ensure that systemd
|
||||
can successfully log to remote targets. By default the syslog rules prevent
|
||||
any outside world access or port access, both of which are required for
|
||||
log forwarding.
|
||||
|
||||
For log encryption in transit a ssl certificate will have to be generated and
|
||||
distributed to all nodes in the cloud securely, probably during deployment.
|
||||
Special care should be taken to ensure that any misconfigured instance of
|
||||
rsyslog without a certificate where one is required do not transmit logs
|
||||
by accident.
|
||||
|
||||
|
||||
Other End User Impact
|
||||
---------------------
|
||||
|
||||
Ideally users will read some documentation and pass an extra 5-6 variables to
|
||||
TripleO to deploy with logging aggregation. It's very important that logging
|
||||
be easy to setup with sane defaults and no requirement on the user to implement
|
||||
their own formatting or template.
|
||||
|
||||
Users may also have to setup a database for log storage and an aggregator if
|
||||
their deployment is large enough that they need one. Playbooks to do this
|
||||
automatically will be provided, but probably don't belong in TripleO.
|
||||
|
||||
Special care will have to be taken to size storage and aggregation hardware
|
||||
to the task, while rsyslog is very efficient storage quickly becomes a problem
|
||||
when a cloud can generate 100gb of logs a day. Especially since log storage
|
||||
systems leave it up to the user to put in place rotation rules.
|
||||
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
For small clouds rsyslog direct to Elasticsearch will perform just fine.
|
||||
As scale increases an aggregator (also running rsyslog, except configured
|
||||
to accept and format input) is required. I have yet to test a large enough
|
||||
cloud that an aggregator was at all stressed. Hundreds of gigs of logs a day
|
||||
are possible with a single 32gb ram VM as an Elastic instance.
|
||||
|
||||
For the Overcloud nodes forwarding their logs the impact is variable depending
|
||||
on the users configuration. CPU requirements don't exceed single digits of a
|
||||
single core even under heavy load but storage requirements can balloon if a
|
||||
large on disk cache was specified and connectivity with the aggregator or
|
||||
database is lost for prolonged periods.
|
||||
|
||||
Memory usage is no more than a few hundred mb and most of that is the default
|
||||
in memory log cache. Which once again could be expanded by the user.
|
||||
|
||||
|
||||
Other Deployer Impact
|
||||
---------------------
|
||||
|
||||
N/A
|
||||
|
||||
Developer Impact
|
||||
----------------
|
||||
|
||||
N/A
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Who is leading the writing of the code? Or is this a blueprint where you're
|
||||
throwing it out there to see who picks it up?
|
||||
|
||||
If more than one person is working on the implementation, please designate the
|
||||
primary author and contact.
|
||||
|
||||
Primary assignee:
|
||||
jkilpatr
|
||||
|
||||
Other contributors:
|
||||
jaosorior
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
rsyslog container - jaosorior
|
||||
|
||||
rsyslog templating and deployment role - jkilpatr
|
||||
|
||||
aggregator and storage server deployment tooling - jkilpatr
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
Blueprint dependencies:
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+spec/logging-stdout-rsyslog
|
||||
|
||||
Package dependencies:
|
||||
|
||||
rsyslog, rsyslog-elasticsearch, rsyslog-mmjsonparse
|
||||
|
||||
specifically version 8 of rsyslog, which is the earliest
|
||||
supported by rsyslog-elasticsearch, these are packaged in
|
||||
Centos and rhel 7.4 extras.
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Logging aggregation can be tested in CI by deploying it during any existing CI job.
|
||||
|
||||
For extra validation have a script to check the output into Elasticsearch.
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
Documentation will need to be written about the various modes and tunables for
|
||||
logging and how to deploy them. As well as sizing recommendations for the log
|
||||
storage system and aggregators where required.
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
https://review.openstack.org/#/c/490047/
|
||||
|
||||
https://review.openstack.org/#/c/521083/
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+spec/logging-stdout-rsyslog
|
Loading…
x
Reference in New Issue
Block a user