Merge "Improve ansible logging within tripleoclient"

2021-07-01 15:22:12 +00:00 · 2021-07-01 15:22:12 +00:00 · df4316d0ba
commit df4316d0ba
parent 5e386c558f f03d1b3c42
1 changed files with 304 additions and 0 deletions
--- a/specs/xena/ansible-logging-tripleoclient.rst
+++ b/specs/xena/ansible-logging-tripleoclient.rst
@ -0,0 +1,304 @@
 ..
 This work is licensed under a Creative Commons Attribution 3.0 Unported
 License.
 http://creativecommons.org/licenses/by/3.0/legalcode
 ==================================================
 Improve logging for ansible calls in tripleoclient
 ==================================================
 Launchpad blueprint:
 https://blueprints.launchpad.net/tripleo/+spec/ansible-logging-tripleoclient
 Problem description
 ===================
 Currently, the ansible playbooks logging as shown during a deploy or day-2
 operations such us upgrade, update, scaling is either too verbose, or not
 enough.
 Furthermore, since we're moving to ephemeral services on the Undercloud (see
 `ephemeral heat`_ for instance), getting information about the state, content
 and related things is a bit less intuitive. A proper logging, with associated
 CLI, can really improve that situation and provide a better user experience.
 Requirements for the solution
 =============================
 No new service addition
 -----------------------
 We are already trying to remove things from the Undercloud, such as Mistral,
 it's not in order to add new services.
 No increase in deployment and day-2 operations time
 ---------------------------------------------------
 The solution must not increase the time taken for deploy, update, upgrades,
 scaling and any other day-2 operations. It must be 100% transparent to the
 operator.
 Use existing tools
 ------------------
 In the same way we don't want to have new services, we don't want to reinvent
 the wheel once more, and we must check the already huge catalog of existing
 solutions.
 KISS
 ----
 Keep It Simple Stupid is a key element - code must be easy to understand and
 maintain.
 Proposed Change
 ===============
 Introduction
 ------------
 While working on the `Validation Framework`_, a big part was about the logging.
 There, we found a way to get an actual computable output, and store it in a
 defined location, allowing to provide a nice interface in order to list and
 show validation runs.
 This heavily relies on an ansible callback plugin with specific libs, which are
 shipped in `python-validations-libs`_ package.
 Since the approach is modular, those libs can be re-used pretty easily in other
 projects.
 In addition, python-tripleoclient already depends on `python-validations-libs`_
 (via a dependency on validations-common), meaning we already have the needed
 bits.
 The Idea
 --------
 Since we have the mandatory code already present on the system (provided by the
 new `python-validations-libs`_ package), we can modify how ansible-runner is
 configured in order to inject a callback, and get the output we need in both
 the shell (direct feedback to the operator) and in a dedicated file.
 Since callback aren't cheap (but, hopefully not expensive either), proper PoC
 must be conducted in order to gather metrics about CPU, RAM and time. Please
 see Performance Impact section.
 Direct feedback
 ---------------
 The direct feedback will tell the operator about the current task being done
 and, when it ends, if it's a success or not.
 Using a callback might provide a "human suited" output.
 File logging
 ------------
 Here, we must define multiple things, and take into account we're running
 multiple playbooks, with multiple calls to ansible-runner.
 File location
 .............
 Nowadays, most if not all of the deploy related files are located in the
 user home directory (i.e. ~/overcloud-deploy/<stack>/).
 It therefore sounds reasonable to get the log in the same location, or a
 subdirectory in that location.
 Keeping this location also solves the potential access right issue, since a
 standard home directory has a 0700 mode, preventing any other user to access
 its content.
 We might even go a bit deeper, and enforce a 0600 mode, just to be sure.
 Remember, logs might include sensitve data, especially when we're running with
 extra debugging.
 File format convention
 ......................
 In order to make the logs easily usable by automated tools, and since we
 already heavily rely on JSON, the log output should be formated as JSON. This
 would allow to add some new CLI commands such as "history list", "history show"
 and so on.
 Also, JSON being well known by logging services such as ElasticSearch, using it
 makes sending them to some central logging service really easy and convenient.
 While JSON is nice, it will more than probably prevent a straight read by the
 operator - but with a working CLI, we might get something closer to what we
 have in the `Validation Framework`_, for instance (see `this example`_). We
 might even consider a CLI that will allow to convert from JSON to whatever
 the operator might want, including but not limited to XML, plain text or JUnit
 (Jenkins).
 There should be a new parameter allowing to switch the format, from "plain" to
 "json" - the default value is still subject to discussion, but providing this
 parameter will ensure Operators can do whetever they want with the default
 format. A concensus seems to indicate "default to plain".
 Filename convention
 ...................
 As said, we're running multiple playbooks during the actions, and we also want
 to have some kind of history.
 In order to do that, the easiest way to get a name is to concatenate the time
 and the playbook name, something like:
 * *timestamp*-*playbookname*.json
 Use systemd/journald instead of files
 .....................................
 One might want to use systemd/journald instead of plain files. While this
 sounds appealing, there are multiple potential issues:
 #. Sensitive data will be shown in the system's journald, at hand of any other
   user
 #. Journald has rate limitations and threshold, meaning we might hit them, and
   therefore lose logs, or prevent other services to use journald for their
   own logging
 #. While we can configure a log service (rsyslog, syslog-ng, etc) in order to
   output specific content to specific files, we will face access issues on
   them
 Therefore, we shouldn't use journald.
 Does it meet the requirements?
 ------------------------------
 * No service addition: yes - it's only a change in the CLI, no new dependecy is
  needed (tripleoclient already depends on validations-common, which depends on
  validations-libs)
 * No increase in operation time: this has to be proven with proper PoC and
  metrics gathering/comparison.
 * Existing Tool: yes
 * Actively maintained: so far, yes - expected to be extended outside of TripleO
 * KISS: yes, based on the validations-libs and simple Ansible callback
 Alternatives
 ============
 ARA
 ---
 `ARA Records Ansible`_ provides some of the functionnalities we implemented in
 the Validation Framework logging, but it lacks some of the wanted features,
 such as
 * CLI integration within tripleoclient
 * Third-party service independency
 * plain file logging in order to scrap them with SOSReport or other tools
 ARA needs a DB backend - we could inject results in the existing galera DB, but
 that might create some issues with the concurrent accesses happening during a
 deploy for instance. Using sqlite is also an option, but it means new packages,
 new file location to save, binary format and so on.
 It also needs some web server in order to show the reporting, meaning yet
 another httpd configuration, and the need to access to it on the undercloud.
 Also, ARA being a whole service, it would require to deploy it, configure it,
 and maintain it - plus ensure it is properly running before each action in
 order to ensure it gets the logs.
 By default, ARA doesn't affect the actual playbook output, while the goal of
 this spec is mostly about it: provide a concise feedback to the operator, while
 keeping the logs on disk, in files, with the ability to interact with them
 through the CLI directly.
 In the end, ARA might be a solution, but it will require more work to get it
 integrated, and, since the Triple UI has been deprecated, there isn't real way
 to integrate it in an existing UI tool.
 Would it meet the requirements?
 ...............................
 * No service addition: no, due to the "REST API" aspect. A service must answer
  API calls
 * No increase in operation time: probably yes, depending on the way ARA can
  manage inputs queues. Since it's also using a callback, we have to account
  for the potential resources used by it.
 * Existing tool: yes
 * Actively maintained: yes
 * KISS: yes, but it adds new dependencies (DB backend, Web server, ARA service,
  and so on)
 Note on the "new dependencies": while ARA can be launched
 `without any service`_, it seems to be only for devel purpose, according to the
 informative note we can read on the documentation page::
  Good for small scale usage but inefficient and contains a lot of small files
  at a large scale.
 Therefore, we shouldn't use ARA.
 Proposed Roadmap
 ================
 In Xena:
 * Ensure we have all the ABI capabilities within validations-libs in order to
  set needed/wanted parameters for a different log location and file naming
 * Start to work on the ansible-runner calls so that it uses a tweaked callback,
  using the validations-libs capabilities in order to get the direct feedback
  as well as the formatted file in the right location
 Security Impact
 ===============
 As we're going to store full ansible output on the disk, we must ensure log
 location accesses are closed to any non-wanted user. As stated while talking
 about the file location, the directory mode and ownership must be set so that
 only the needed users can access its content (root + stack user)
 Once this is sorted out, no other security impact is to be expected - further
 more, it will even make things more secure than now, since the current way
 ansible is launched within tripleoclient puts an "ansible.log" file in the
 operator home directory without any specific rights.
 Upgrade Impact
 ==============
 Appart from ensuring the log location exists, there isn't any major upgrade
 impact. A doc update must be done in order to point to the log location, as
 well as some messages within the CLI.
 End User Impact
 ===============
 There are two impacts to the End User:
 * CLI output will be reworked in order to provide useful information (see
  Direct Feedback above)
 * Log location will change a bit for the ansible part (see File Logging above)
 Performance Impact
 ==================
 A limited impact is to be expected - but proper PoC with metrics must be
 conducted to assess the actual change.
 Multiple deploys must be done, with different Overcloud design, in order to
 see the actual impact alongside the number of nodes.
 Deployer Impact
 ===============
 Same as End User Impact: CLI output will be changed, and the log location will
 be updated.
 Developer Impact
 ================
 The callback is enabled by default, but the Developer might want to disable it.
 Proper doc should reflect this. No real impact in the end.
 Implementation
 ==============
 Contributors
 ------------
 * Cédric Jeanneret
 * Mathieu Bultel
 Work Items
 ----------
 * Modify validations-libs in order to provided the needed interface (shouldn't
  be really needed, the libs are already modular and should expose the wanted
  interfaces and parameters)
 * Create a new callback in tripleo-ansible
 * Ensure the log directory is created with the correct rights
 * Update the ansible-runner calls to enable the callback by default
 * Ensure tripleoclient outputs status update on a regular basis while the logs
  are being written in the right location
 * Update/create the needed documentations about the new logging location and
  management
 .. _ephemeral heat: https://specs.openstack.org/openstack/tripleo-specs/specs/wallaby/ephemeral-heat-overcloud.html
 .. _Validation Framework: https://specs.openstack.org/openstack/tripleo-specs/specs/stein/validation-framework.html
 .. _this example: https://asciinema.org/a/283645
 .. _python-validations-libs: https://opendev.org/openstack/validations-libs
 .. _ARA Records Ansible: https://ara.recordsansible.org/
 .. _without any service: https://ara.readthedocs.io/en/latest/cli.html#ara-manage-generate
 .. _ansible "acl": https://docs.ansible.com/ansible/latest/modules/acl_module.html