Merge "Improve ansible logging within tripleoclient"
This commit is contained in:
commit
df4316d0ba
304
specs/xena/ansible-logging-tripleoclient.rst
Normal file
304
specs/xena/ansible-logging-tripleoclient.rst
Normal file
@ -0,0 +1,304 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
==================================================
|
||||
Improve logging for ansible calls in tripleoclient
|
||||
==================================================
|
||||
|
||||
Launchpad blueprint:
|
||||
|
||||
https://blueprints.launchpad.net/tripleo/+spec/ansible-logging-tripleoclient
|
||||
|
||||
Problem description
|
||||
===================
|
||||
Currently, the ansible playbooks logging as shown during a deploy or day-2
|
||||
operations such us upgrade, update, scaling is either too verbose, or not
|
||||
enough.
|
||||
|
||||
Furthermore, since we're moving to ephemeral services on the Undercloud (see
|
||||
`ephemeral heat`_ for instance), getting information about the state, content
|
||||
and related things is a bit less intuitive. A proper logging, with associated
|
||||
CLI, can really improve that situation and provide a better user experience.
|
||||
|
||||
|
||||
Requirements for the solution
|
||||
=============================
|
||||
No new service addition
|
||||
-----------------------
|
||||
We are already trying to remove things from the Undercloud, such as Mistral,
|
||||
it's not in order to add new services.
|
||||
|
||||
No increase in deployment and day-2 operations time
|
||||
---------------------------------------------------
|
||||
The solution must not increase the time taken for deploy, update, upgrades,
|
||||
scaling and any other day-2 operations. It must be 100% transparent to the
|
||||
operator.
|
||||
|
||||
Use existing tools
|
||||
------------------
|
||||
In the same way we don't want to have new services, we don't want to reinvent
|
||||
the wheel once more, and we must check the already huge catalog of existing
|
||||
solutions.
|
||||
|
||||
KISS
|
||||
----
|
||||
Keep It Simple Stupid is a key element - code must be easy to understand and
|
||||
maintain.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
Introduction
|
||||
------------
|
||||
While working on the `Validation Framework`_, a big part was about the logging.
|
||||
There, we found a way to get an actual computable output, and store it in a
|
||||
defined location, allowing to provide a nice interface in order to list and
|
||||
show validation runs.
|
||||
|
||||
This heavily relies on an ansible callback plugin with specific libs, which are
|
||||
shipped in `python-validations-libs`_ package.
|
||||
|
||||
Since the approach is modular, those libs can be re-used pretty easily in other
|
||||
projects.
|
||||
|
||||
In addition, python-tripleoclient already depends on `python-validations-libs`_
|
||||
(via a dependency on validations-common), meaning we already have the needed
|
||||
bits.
|
||||
|
||||
The Idea
|
||||
--------
|
||||
Since we have the mandatory code already present on the system (provided by the
|
||||
new `python-validations-libs`_ package), we can modify how ansible-runner is
|
||||
configured in order to inject a callback, and get the output we need in both
|
||||
the shell (direct feedback to the operator) and in a dedicated file.
|
||||
|
||||
Since callback aren't cheap (but, hopefully not expensive either), proper PoC
|
||||
must be conducted in order to gather metrics about CPU, RAM and time. Please
|
||||
see Performance Impact section.
|
||||
|
||||
Direct feedback
|
||||
---------------
|
||||
The direct feedback will tell the operator about the current task being done
|
||||
and, when it ends, if it's a success or not.
|
||||
|
||||
Using a callback might provide a "human suited" output.
|
||||
|
||||
File logging
|
||||
------------
|
||||
Here, we must define multiple things, and take into account we're running
|
||||
multiple playbooks, with multiple calls to ansible-runner.
|
||||
|
||||
File location
|
||||
.............
|
||||
Nowadays, most if not all of the deploy related files are located in the
|
||||
user home directory (i.e. ~/overcloud-deploy/<stack>/).
|
||||
It therefore sounds reasonable to get the log in the same location, or a
|
||||
subdirectory in that location.
|
||||
|
||||
Keeping this location also solves the potential access right issue, since a
|
||||
standard home directory has a 0700 mode, preventing any other user to access
|
||||
its content.
|
||||
|
||||
We might even go a bit deeper, and enforce a 0600 mode, just to be sure.
|
||||
|
||||
Remember, logs might include sensitve data, especially when we're running with
|
||||
extra debugging.
|
||||
|
||||
File format convention
|
||||
......................
|
||||
In order to make the logs easily usable by automated tools, and since we
|
||||
already heavily rely on JSON, the log output should be formated as JSON. This
|
||||
would allow to add some new CLI commands such as "history list", "history show"
|
||||
and so on.
|
||||
|
||||
Also, JSON being well known by logging services such as ElasticSearch, using it
|
||||
makes sending them to some central logging service really easy and convenient.
|
||||
|
||||
While JSON is nice, it will more than probably prevent a straight read by the
|
||||
operator - but with a working CLI, we might get something closer to what we
|
||||
have in the `Validation Framework`_, for instance (see `this example`_). We
|
||||
might even consider a CLI that will allow to convert from JSON to whatever
|
||||
the operator might want, including but not limited to XML, plain text or JUnit
|
||||
(Jenkins).
|
||||
|
||||
There should be a new parameter allowing to switch the format, from "plain" to
|
||||
"json" - the default value is still subject to discussion, but providing this
|
||||
parameter will ensure Operators can do whetever they want with the default
|
||||
format. A concensus seems to indicate "default to plain".
|
||||
|
||||
Filename convention
|
||||
...................
|
||||
As said, we're running multiple playbooks during the actions, and we also want
|
||||
to have some kind of history.
|
||||
|
||||
In order to do that, the easiest way to get a name is to concatenate the time
|
||||
and the playbook name, something like:
|
||||
|
||||
* *timestamp*-*playbookname*.json
|
||||
|
||||
Use systemd/journald instead of files
|
||||
.....................................
|
||||
One might want to use systemd/journald instead of plain files. While this
|
||||
sounds appealing, there are multiple potential issues:
|
||||
|
||||
#. Sensitive data will be shown in the system's journald, at hand of any other
|
||||
user
|
||||
#. Journald has rate limitations and threshold, meaning we might hit them, and
|
||||
therefore lose logs, or prevent other services to use journald for their
|
||||
own logging
|
||||
#. While we can configure a log service (rsyslog, syslog-ng, etc) in order to
|
||||
output specific content to specific files, we will face access issues on
|
||||
them
|
||||
|
||||
Therefore, we shouldn't use journald.
|
||||
|
||||
Does it meet the requirements?
|
||||
------------------------------
|
||||
* No service addition: yes - it's only a change in the CLI, no new dependecy is
|
||||
needed (tripleoclient already depends on validations-common, which depends on
|
||||
validations-libs)
|
||||
* No increase in operation time: this has to be proven with proper PoC and
|
||||
metrics gathering/comparison.
|
||||
* Existing Tool: yes
|
||||
* Actively maintained: so far, yes - expected to be extended outside of TripleO
|
||||
* KISS: yes, based on the validations-libs and simple Ansible callback
|
||||
|
||||
Alternatives
|
||||
============
|
||||
|
||||
ARA
|
||||
---
|
||||
`ARA Records Ansible`_ provides some of the functionnalities we implemented in
|
||||
the Validation Framework logging, but it lacks some of the wanted features,
|
||||
such as
|
||||
|
||||
* CLI integration within tripleoclient
|
||||
* Third-party service independency
|
||||
* plain file logging in order to scrap them with SOSReport or other tools
|
||||
|
||||
ARA needs a DB backend - we could inject results in the existing galera DB, but
|
||||
that might create some issues with the concurrent accesses happening during a
|
||||
deploy for instance. Using sqlite is also an option, but it means new packages,
|
||||
new file location to save, binary format and so on.
|
||||
|
||||
It also needs some web server in order to show the reporting, meaning yet
|
||||
another httpd configuration, and the need to access to it on the undercloud.
|
||||
|
||||
Also, ARA being a whole service, it would require to deploy it, configure it,
|
||||
and maintain it - plus ensure it is properly running before each action in
|
||||
order to ensure it gets the logs.
|
||||
|
||||
By default, ARA doesn't affect the actual playbook output, while the goal of
|
||||
this spec is mostly about it: provide a concise feedback to the operator, while
|
||||
keeping the logs on disk, in files, with the ability to interact with them
|
||||
through the CLI directly.
|
||||
|
||||
In the end, ARA might be a solution, but it will require more work to get it
|
||||
integrated, and, since the Triple UI has been deprecated, there isn't real way
|
||||
to integrate it in an existing UI tool.
|
||||
|
||||
Would it meet the requirements?
|
||||
...............................
|
||||
* No service addition: no, due to the "REST API" aspect. A service must answer
|
||||
API calls
|
||||
* No increase in operation time: probably yes, depending on the way ARA can
|
||||
manage inputs queues. Since it's also using a callback, we have to account
|
||||
for the potential resources used by it.
|
||||
* Existing tool: yes
|
||||
* Actively maintained: yes
|
||||
* KISS: yes, but it adds new dependencies (DB backend, Web server, ARA service,
|
||||
and so on)
|
||||
|
||||
Note on the "new dependencies": while ARA can be launched
|
||||
`without any service`_, it seems to be only for devel purpose, according to the
|
||||
informative note we can read on the documentation page::
|
||||
|
||||
Good for small scale usage but inefficient and contains a lot of small files
|
||||
at a large scale.
|
||||
|
||||
Therefore, we shouldn't use ARA.
|
||||
|
||||
Proposed Roadmap
|
||||
================
|
||||
In Xena:
|
||||
|
||||
* Ensure we have all the ABI capabilities within validations-libs in order to
|
||||
set needed/wanted parameters for a different log location and file naming
|
||||
* Start to work on the ansible-runner calls so that it uses a tweaked callback,
|
||||
using the validations-libs capabilities in order to get the direct feedback
|
||||
as well as the formatted file in the right location
|
||||
|
||||
Security Impact
|
||||
===============
|
||||
As we're going to store full ansible output on the disk, we must ensure log
|
||||
location accesses are closed to any non-wanted user. As stated while talking
|
||||
about the file location, the directory mode and ownership must be set so that
|
||||
only the needed users can access its content (root + stack user)
|
||||
|
||||
Once this is sorted out, no other security impact is to be expected - further
|
||||
more, it will even make things more secure than now, since the current way
|
||||
ansible is launched within tripleoclient puts an "ansible.log" file in the
|
||||
operator home directory without any specific rights.
|
||||
|
||||
Upgrade Impact
|
||||
==============
|
||||
Appart from ensuring the log location exists, there isn't any major upgrade
|
||||
impact. A doc update must be done in order to point to the log location, as
|
||||
well as some messages within the CLI.
|
||||
|
||||
End User Impact
|
||||
===============
|
||||
There are two impacts to the End User:
|
||||
|
||||
* CLI output will be reworked in order to provide useful information (see
|
||||
Direct Feedback above)
|
||||
* Log location will change a bit for the ansible part (see File Logging above)
|
||||
|
||||
Performance Impact
|
||||
==================
|
||||
A limited impact is to be expected - but proper PoC with metrics must be
|
||||
conducted to assess the actual change.
|
||||
|
||||
Multiple deploys must be done, with different Overcloud design, in order to
|
||||
see the actual impact alongside the number of nodes.
|
||||
|
||||
Deployer Impact
|
||||
===============
|
||||
Same as End User Impact: CLI output will be changed, and the log location will
|
||||
be updated.
|
||||
|
||||
Developer Impact
|
||||
================
|
||||
The callback is enabled by default, but the Developer might want to disable it.
|
||||
Proper doc should reflect this. No real impact in the end.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
Contributors
|
||||
------------
|
||||
* Cédric Jeanneret
|
||||
* Mathieu Bultel
|
||||
|
||||
Work Items
|
||||
----------
|
||||
* Modify validations-libs in order to provided the needed interface (shouldn't
|
||||
be really needed, the libs are already modular and should expose the wanted
|
||||
interfaces and parameters)
|
||||
* Create a new callback in tripleo-ansible
|
||||
* Ensure the log directory is created with the correct rights
|
||||
* Update the ansible-runner calls to enable the callback by default
|
||||
* Ensure tripleoclient outputs status update on a regular basis while the logs
|
||||
are being written in the right location
|
||||
* Update/create the needed documentations about the new logging location and
|
||||
management
|
||||
|
||||
.. _ephemeral heat: https://specs.openstack.org/openstack/tripleo-specs/specs/wallaby/ephemeral-heat-overcloud.html
|
||||
.. _Validation Framework: https://specs.openstack.org/openstack/tripleo-specs/specs/stein/validation-framework.html
|
||||
.. _this example: https://asciinema.org/a/283645
|
||||
.. _python-validations-libs: https://opendev.org/openstack/validations-libs
|
||||
.. _ARA Records Ansible: https://ara.recordsansible.org/
|
||||
.. _without any service: https://ara.readthedocs.io/en/latest/cli.html#ara-manage-generate
|
||||
.. _ansible "acl": https://docs.ansible.com/ansible/latest/modules/acl_module.html
|
Loading…
Reference in New Issue
Block a user