ironic-specs/specs/approved/collect-system-logs-from-ipa.rst
Devananda van der Veen dd42ff0341 Add "Ramdisk impact" section to spec template
As per the weekly meeting today [1] we have agreed to add a
new section to the spec template for changes that affect the
IPA ramdisk.

[1] http://eavesdrop.openstack.org/meetings/ironic/2016/ironic.2016-05-23-17.00.log.txt

Change-Id: I0f62e233dc7f2ad3e9940439f8ad7740de5e65c9
2016-07-15 13:43:34 +01:00

304 lines
9.0 KiB
ReStructuredText

..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
============================
Collect system logs from IPA
============================
https://bugs.launchpad.net/ironic/+bug/1587143
This spec adds support for retrieving the deployment system logs from IPA.
Problem description
===================
We currently have no mechanism to automatically retrieve the system
logs from the IPA deploy ramdisk. Having access to the logs may be
very useful, especially when troubleshooting a deployment failure.
Currently, there are a few ways to get access to the logs in the ramdisk,
but they are manual, and sometimes it is not desirable to enable them
in production. The following points describe two of them:
#. Have a console session enabled for the node being deployed.
While this works, it's tricky because the operator needs to figure
out which node was picked by the scheduler and enable the console
for it. Also, not all drivers have console support.
#. Disable powering off a node upon a deployment failure.
Operators could `disable powering off a node upon a deployment
failure <https://review.openstack.org/#/c/259119>`_ but this has
some implications:
a. It does not work in conjunction with Nova. When the instance
fails to be provisioned nova will invoke destroy() and the Ironic
virt driver will then force a power off on that node.
b. Leaving the nodes powered on after the failure is not desirable
in some deployments.
Proposed change
===============
The proposed implementation consists in having Ironic retrieve the
system logs from the deploy ramdisk (IPA) via its API and then upload
it to Swift or save it on the local file-system of that conductor (for
standalone-mode users).
Changes in IPA
--------------
A new ``log`` extension will be added to IPA. This extension will
introduce a new synchronous command called ``collect_system_logs``. By
invoking this command IPA will then tar, gzip and base64 encode the
system logs and return the resulting string to the caller.
Since we do support different base OSs for IPA (e.g Tiny Core Linux,
Fedora, Debian) we need different ways to find the logs depending on the
system. This spec proposes two ways that should be enough for most of
the distros today:
#. For distributions using ``systemd``, all system logs are available via
``journald``. IPA will then invoke the ``journalctl`` command and
get the logs from there.
#. For other distributions, this spec proposes retaining all the logs
from */var/log* and the output of the ``dmesg`` command.
The logs from all distributions independent of the init system, will
also contain the output of the following commands files: ``ps``, ``df``,
and ``iptables``.
Changes in Ironic
-----------------
New configuration options will be added to Ironic under the ``[agent]``
group:
#. ``deploy_logs_collect`` (string): Whether Ironic should collect the
deployment logs or not. Valid options are: "always", "on_failure" or
"never". Defaults to "on_failure".
#. ``deploy_logs_storage_backend`` (string): The name of the storage
backend where the response file will be stored. One of the two:
"local" or "swift". Defaults to "local".
#. ``deploy_logs_local_path`` (string): The path to the directory
where the logs should be stored, used when the
``deploy_logs_storage_backend`` is configured to **local**. Defaults to
``/var/log/ironic/deploy``.
#. ``deploy_logs_swift_container`` (string): The name of the Swift
container to store the logs, used when the
``deploy_logs_storage_backend`` is configured to **swift**. Defaults
to *ironic_deploy_logs_container*.
#. ``deploy_logs_swift_days_to_expire`` (integer):
Number of days before a log object is `marked as expired in Swift
<http://docs.openstack.org/developer/swift/overview_expiring_objects.html>`_.
If None, the logs will be kept forever or until manually deleted. Used
when the ``deploy_logs_storage_backend`` is configured to **swift**.
Defaults to *30 days*.
.. note::
When storing the logs in the local file-system Ironic won't be
responsible for deleting the logs after a certain time. It's up to
the operator to configure an external job to do it, if wanted.
Depending on the value of the ``deploy_logs_collect`` Ironic will
invoke ``log.collect_system_logs`` as part of the deployment of the
node (right before powering it off or rebooting). For example, if
``deploy_logs_collect`` is set to **always** Ironic will collect the logs
independently of the deployment being a success or a failure; if it is set
to **on_failure** Ironic will collect the logs upon a deployment failure;
if it is set to **never**, Ironic never collect the deployment logs.
When the logs are collected, Ironic should decode the base64 encoded
tar.gz file and store it according to the ``deploy_logs_storage_backend``
configuration. All log objects will be named with the following pattern:
*<node-uuid>[_<instance-uuid>]_<timestamp yyyy-mm-dd-hh:mm:ss>.tar.gz*. Note
that, ``instance_uuid`` is not a required field for deploying a node when
Ironic is configured to be used in **standalone** mode so, if present
it will be appended to the name.
When using Swift, operators can associate the objects in the container
with the nodes in Ironic and search for the logs of a specific node
using the ``prefix`` parameter, for example:
.. code-block:: bash
$ swift list ironic_deploy_logs_container -p 5e9258c4-cfda-40b6-86e2-e192f523d668
5e9258c4-cfda-40b6-86e2-e192f523d668_0c1e1a65-6af0-4cb7-a16e-8f9a45144b47_2016-05-31_22:05:59
5e9258c4-cfda-40b6-86e2-e192f523d668_db87f2c5-7a9a-48c2-9a76-604287257c1b_2016-05-31_22:07:25
.. note::
This implementation requires the network to be setup correctly,
otherwise Ironic will not be able to contact the IPA API. When
debugging such problems, the only action possible is to look at the
consoles of the nodes to see some logs. This method has some caveats:
see the `Problem description`_ for more information.
.. note::
Neither Ironic or IPA will be responsible for sanitizing any logs
before storing them. First because this spec is limited to collecting
logs from the deployment only and at this point the tenant won't have
used the node yet. Second, the services generating the logs should be
responsible for masking secrets in their logs (like we do in Ironic),
if not, it should be considered a bug.
Alternatives
------------
Since we already provide ways of doing that via accessing the console
or disabling the powering off the nodes on failures, there are few
alternatives left for this work.
The current proposed solution could be extended to fit more use cases
beyond what this spec proposes. For example, instead of uploading it to
Swift or storing it in the local file-system, Ironic could upload it to
a HTTP/FTP server.
As briefly described at `Changes in IPA`_ the method to collect the logs
could be extended to include more logs and output of different commands
that are useful for troubleshooting.
Data model impact
-----------------
None
State Machine Impact
--------------------
None
REST API impact
---------------
None
Client (CLI) impact
-------------------
None
RPC API impact
--------------
None
Driver API impact
-----------------
None
Nova driver impact
------------------
None
Ramdisk impact
--------------
None
Security impact
---------------
None.
As a note, credentials **are not** passed from Ironic to the deploy
ramdisk. The ``ironic-conductor`` service, which already holds the Swift
credentials, is the one responsible for uploading the logs to Swift.
Other end user impact
---------------------
None
Scalability impact
------------------
None
Performance Impact
------------------
The node will stay a little longer in the ``deploying`` provision state
while IPA is collecting the logs, if enabled.
Other deployer impact
---------------------
None
Developer impact
----------------
None
Implementation
==============
Assignee(s)
-----------
Primary assignee:
lucasagomes <lucasagomes@gmail.com>
Other contributors:
Work Items
----------
* Add the new ``log`` extension and ``collect_system_logs`` method in IPA.
* Add the new configuration options described in the `Changes in Ironic`_
section.
* Invoke the new ``log.collect_system_logs`` method in IPA as part
of the deployment and store the response file according to the
``deploy_logs_storage_backend`` configuration option (if enabled).
Dependencies
============
None
Testing
=======
Unittests will be added.
Upgrades and Backwards Compatibility
====================================
None.
As a note, when using an old IPA ramdisk which does not support the new
``log.collect_system_logs`` command Ironic should handle such exception
and log a warning message to the operator if ``deploy_logs_collect``
is set to **always** or **on_failure**.
Documentation Impact
====================
Documentation will be provided about how to configure Ironic to collect
the system logs from the deploy ramdisk.
References
==========
None.