As per the weekly meeting today [1] we have agreed to add a new section to the spec template for changes that affect the IPA ramdisk. [1] http://eavesdrop.openstack.org/meetings/ironic/2016/ironic.2016-05-23-17.00.log.txt Change-Id: I0f62e233dc7f2ad3e9940439f8ad7740de5e65c9
9.0 KiB
Collect system logs from IPA
https://bugs.launchpad.net/ironic/+bug/1587143
This spec adds support for retrieving the deployment system logs from IPA.
Problem description
We currently have no mechanism to automatically retrieve the system logs from the IPA deploy ramdisk. Having access to the logs may be very useful, especially when troubleshooting a deployment failure. Currently, there are a few ways to get access to the logs in the ramdisk, but they are manual, and sometimes it is not desirable to enable them in production. The following points describe two of them:
Have a console session enabled for the node being deployed.
While this works, it's tricky because the operator needs to figure out which node was picked by the scheduler and enable the console for it. Also, not all drivers have console support.
Disable powering off a node upon a deployment failure.
Operators could disable powering off a node upon a deployment failure but this has some implications:
- It does not work in conjunction with Nova. When the instance fails to be provisioned nova will invoke destroy() and the Ironic virt driver will then force a power off on that node.
- Leaving the nodes powered on after the failure is not desirable in some deployments.
Proposed change
The proposed implementation consists in having Ironic retrieve the system logs from the deploy ramdisk (IPA) via its API and then upload it to Swift or save it on the local file-system of that conductor (for standalone-mode users).
Changes in IPA
A new log
extension will be added to IPA. This extension
will introduce a new synchronous command called
collect_system_logs
. By invoking this command IPA will then
tar, gzip and base64 encode the system logs and return the resulting
string to the caller.
Since we do support different base OSs for IPA (e.g Tiny Core Linux, Fedora, Debian) we need different ways to find the logs depending on the system. This spec proposes two ways that should be enough for most of the distros today:
- For distributions using
systemd
, all system logs are available viajournald
. IPA will then invoke thejournalctl
command and get the logs from there. - For other distributions, this spec proposes retaining all the logs
from /var/log and the output of the
dmesg
command.
The logs from all distributions independent of the init system, will
also contain the output of the following commands files:
ps
, df
, and iptables
.
Changes in Ironic
New configuration options will be added to Ironic under the
[agent]
group:
deploy_logs_collect
(string): Whether Ironic should collect the deployment logs or not. Valid options are: "always", "on_failure" or "never". Defaults to "on_failure".deploy_logs_storage_backend
(string): The name of the storage backend where the response file will be stored. One of the two: "local" or "swift". Defaults to "local".deploy_logs_local_path
(string): The path to the directory where the logs should be stored, used when thedeploy_logs_storage_backend
is configured to local. Defaults to/var/log/ironic/deploy
.deploy_logs_swift_container
(string): The name of the Swift container to store the logs, used when thedeploy_logs_storage_backend
is configured to swift. Defaults to ironic_deploy_logs_container.deploy_logs_swift_days_to_expire
(integer): Number of days before a log object is marked as expired in Swift. If None, the logs will be kept forever or until manually deleted. Used when thedeploy_logs_storage_backend
is configured to swift. Defaults to 30 days.
Note
When storing the logs in the local file-system Ironic won't be responsible for deleting the logs after a certain time. It's up to the operator to configure an external job to do it, if wanted.
Depending on the value of the deploy_logs_collect
Ironic
will invoke log.collect_system_logs
as part of the
deployment of the node (right before powering it off or rebooting). For
example, if deploy_logs_collect
is set to
always Ironic will collect the logs independently of
the deployment being a success or a failure; if it is set to
on_failure Ironic will collect the logs upon a
deployment failure; if it is set to never, Ironic never
collect the deployment logs.
When the logs are collected, Ironic should decode the base64 encoded
tar.gz file and store it according to the
deploy_logs_storage_backend
configuration. All log objects
will be named with the following pattern:
<node-uuid>[_<instance-uuid>]_<timestamp
yyyy-mm-dd-hh:mm:ss>.tar.gz. Note that,
instance_uuid
is not a required field for deploying a node
when Ironic is configured to be used in standalone mode
so, if present it will be appended to the name.
When using Swift, operators can associate the objects in the
container with the nodes in Ironic and search for the logs of a specific
node using the prefix
parameter, for example:
$ swift list ironic_deploy_logs_container -p 5e9258c4-cfda-40b6-86e2-e192f523d668
5e9258c4-cfda-40b6-86e2-e192f523d668_0c1e1a65-6af0-4cb7-a16e-8f9a45144b47_2016-05-31_22:05:59
5e9258c4-cfda-40b6-86e2-e192f523d668_db87f2c5-7a9a-48c2-9a76-604287257c1b_2016-05-31_22:07:25
Note
This implementation requires the network to be setup correctly, otherwise Ironic will not be able to contact the IPA API. When debugging such problems, the only action possible is to look at the consoles of the nodes to see some logs. This method has some caveats: see the Problem description for more information.
Note
Neither Ironic or IPA will be responsible for sanitizing any logs before storing them. First because this spec is limited to collecting logs from the deployment only and at this point the tenant won't have used the node yet. Second, the services generating the logs should be responsible for masking secrets in their logs (like we do in Ironic), if not, it should be considered a bug.
Alternatives
Since we already provide ways of doing that via accessing the console or disabling the powering off the nodes on failures, there are few alternatives left for this work.
The current proposed solution could be extended to fit more use cases beyond what this spec proposes. For example, instead of uploading it to Swift or storing it in the local file-system, Ironic could upload it to a HTTP/FTP server.
As briefly described at Changes in IPA the method to collect the logs could be extended to include more logs and output of different commands that are useful for troubleshooting.
Data model impact
None
State Machine Impact
None
REST API impact
None
Client (CLI) impact
None
RPC API impact
None
Driver API impact
None
Nova driver impact
None
Ramdisk impact
None
Security impact
None.
As a note, credentials are not passed from Ironic to
the deploy ramdisk. The ironic-conductor
service, which
already holds the Swift credentials, is the one responsible for
uploading the logs to Swift.
Other end user impact
None
Scalability impact
None
Performance Impact
The node will stay a little longer in the deploying
provision state while IPA is collecting the logs, if enabled.
Other deployer impact
None
Developer impact
None
Implementation
Assignee(s)
- Primary assignee:
-
lucasagomes <lucasagomes@gmail.com>
Other contributors:
Work Items
- Add the new
log
extension andcollect_system_logs
method in IPA. - Add the new configuration options described in the Changes in Ironic section.
- Invoke the new
log.collect_system_logs
method in IPA as part of the deployment and store the response file according to thedeploy_logs_storage_backend
configuration option (if enabled).
Dependencies
None
Testing
Unittests will be added.
Upgrades and Backwards Compatibility
None.
As a note, when using an old IPA ramdisk which does not support the
new log.collect_system_logs
command Ironic should handle
such exception and log a warning message to the operator if
deploy_logs_collect
is set to always or
on_failure.
Documentation Impact
Documentation will be provided about how to configure Ironic to collect the system logs from the deploy ramdisk.
References
None.