385 lines
11 KiB
ReStructuredText
385 lines
11 KiB
ReStructuredText
..
|
|
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
|
License.
|
|
|
|
http://creativecommons.org/licenses/by/3.0/legalcode
|
|
|
|
==========================================
|
|
Logging Guidelines
|
|
==========================================
|
|
|
|
https://blueprints.launchpad.net/nova/+spec/log-guidelines
|
|
|
|
Problem description
|
|
===================
|
|
|
|
The current state of logging both within and between OpenStack
|
|
components is inconsistent to the point of being somewhat harmful by
|
|
obscuring the current state, function, and real cause of errors in an
|
|
OpenStack cloud. A consistent, unified logging format will better
|
|
enable cloud administrators to monitor and maintain their
|
|
environments.
|
|
|
|
Before we can address this in OpenStack, we first need to come up with
|
|
a set of guidelines that we can get broad agreement on. This is
|
|
expected to happen in waves, and this is the first iteration to gather
|
|
agreement on.
|
|
|
|
Proposed change
|
|
===============
|
|
|
|
Definition of Log Levels
|
|
------------------------
|
|
|
|
http://stackoverflow.com/a/2031209
|
|
This is a nice writeup about when to use each log level. Here is a
|
|
brief description:
|
|
|
|
- Debug: Shows everything and is likely not suitable for normal
|
|
production operation due to the sheer size of logs generated
|
|
- Info: Usually indicates successful service start/stop, versions and
|
|
such non-error related data. This should include largely positive
|
|
units of work that are accomplished (such as starting a compute,
|
|
creating a user, deleting a volume, etc.)
|
|
- Audit: REMOVE - (all previous Audit messages should be put as INFO)
|
|
- Warning: Indicates that there might be a systemic issue; potential
|
|
predictive failure notice
|
|
- Error: An error has occurred and an administrator should research
|
|
the event
|
|
- Critical: An error has occurred and the system might be unstable;
|
|
immediately get administrator assistance
|
|
|
|
We can think of this from an operator perspective the following ways
|
|
(Note: we are not specifying operator policy here, just trying to set
|
|
tone for developers that aren't familiar with how these messages will
|
|
be interpreted):
|
|
|
|
- Critical : ZOMG! Cluster on FIRE! Call all pagers, wake up
|
|
everyone. This is an unrecoverable error with a service that has or
|
|
probably will lead to service death or massive degredation.
|
|
- Error: Serious issue with cloud, administrator should be notified
|
|
immediately via email/pager. On call people expected to respond.
|
|
- Warning: Something is not right, should get looked into during the
|
|
next work week. Administrators should be working through eliminating
|
|
warnings as part of normal work.
|
|
- Info: normal status messages showing measureable units of positive
|
|
work passing through under normal functioning of the system. Should
|
|
not be so verbose as to overwhelm real signal with noise. Should not
|
|
be continuous "I'm alive!" messages.
|
|
- Debug: developer logging level, only enable if you are interested in
|
|
reading through a ton of additional information about what is going on.
|
|
|
|
Proposed Changes From Status Quo
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
- Deprecate and remove AUDIT level
|
|
|
|
Rationale, AUDIT is confusing, and people use it for entirely the
|
|
wrong purposes. The origin of AUDIT was a NASA specific requirement
|
|
which is not longer really relevant to the current code.
|
|
|
|
Information that was previously being emitted at AUDIT should instead
|
|
be sent as notifications to a notification queue. *Note: Notification formats
|
|
and frequency are beyond the scope of this spec.*
|
|
|
|
Overall Logging Rules
|
|
---------------------
|
|
The following principles should apply to all messages
|
|
|
|
Log messages at Info and above should be a "unit of work"
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
The Info log level is defined as: "normal status messages showing
|
|
measureable units of positive work passing through under normal
|
|
functioning of the system."
|
|
|
|
A measurable unit of work should be describable by a short sentence
|
|
fragment, in the past tense with a noun and a verb of something
|
|
significant.
|
|
|
|
Examples::
|
|
|
|
Instance spawned
|
|
|
|
Instance destroyed
|
|
|
|
Volume attached
|
|
|
|
Image failed to copy
|
|
|
|
Words like "started", "finished", or any verb ending in "ing" are
|
|
flags for non unit of work messages.
|
|
|
|
Debugging start / end messages
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
At the Debug log level it is often extremely important to flag the
|
|
beginning and ending of actions to track the progression of flows
|
|
(which might error out before the unit of work is completed).
|
|
|
|
This should be made clear by there being a "starting" message with
|
|
some indication of completion for that starting point.
|
|
|
|
In a real OpenStack environment lots of things are happening in
|
|
parallel. There are multiple workers per services, multiple instances
|
|
of services in the cloud.
|
|
|
|
Examples of Good and Bad uses of Info
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
Below are some examples of good and bad users of info. In the Good
|
|
examples we can see the 'noun / verb' fragment for a unit of work
|
|
(successfully is probably superfluous and could be removed).
|
|
|
|
In the bad examples we see trace level thinking put into INFO and
|
|
above messages.
|
|
|
|
**Good**
|
|
|
|
::
|
|
2014-01-26 15:36:10.597 28297 INFO nova.virt.libvirt.driver [-]
|
|
[instance: b1b8e5c7-12f0-4092-84f6-297fe7642070] Instance spawned
|
|
successfully.
|
|
|
|
2014-01-26 15:36:14.307 28297 INFO nova.virt.libvirt.driver [-]
|
|
[instance: b1b8e5c7-12f0-4092-84f6-297fe7642070] Instance destroyed
|
|
successfully.
|
|
|
|
**Bad**
|
|
|
|
::
|
|
2014-01-26 15:36:11.198 INFO nova.virt.libvirt.driver
|
|
[req-ded67509-1e5d-4fb2-a0e2-92932bba9271
|
|
FixedIPsNegativeTestXml-1426989627 FixedIPsNegativeTestXml-38506689]
|
|
[instance: fd027464-6e15-4f5d-8b1f-c389bdb8772a] Creating image
|
|
|
|
2014-01-26 15:36:11.525 INFO nova.virt.libvirt.driver
|
|
[req-ded67509-1e5d-4fb2-a0e2-92932bba9271
|
|
FixedIPsNegativeTestXml-1426989627 FixedIPsNegativeTestXml-38506689]
|
|
[instance: fd027464-6e15-4f5d-8b1f-c389bdb8772a] Using config drive
|
|
|
|
2014-01-26 15:36:12.326 AUDIT nova.compute.manager
|
|
[req-714315e2-6318-4005-8f8f-05d7796ff45d FixedIPsTestXml-911165017
|
|
FixedIPsTestXml-1315774890] [instance:
|
|
b1b8e5c7-12f0-4092-84f6-297fe7642070] Terminating instance
|
|
|
|
2014-01-26 15:36:12.570 INFO nova.virt.libvirt.driver
|
|
[req-ded67509-1e5d-4fb2-a0e2-92932bba9271
|
|
FixedIPsNegativeTestXml-1426989627 FixedIPsNegativeTestXml-38506689]
|
|
[instance: fd027464-6e15-4f5d-8b1f-c389bdb8772a] Creating config
|
|
drive at
|
|
/opt/stack/data/nova/instances/fd027464-6e15-4f5d-8b1f
|
|
-c389bdb8772a/disk.config
|
|
|
|
This is mostly an overshare issue. At Info these are stages that don't
|
|
really need to be fully communicated.
|
|
|
|
Messages shouldn't need a secret decoder ring
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
**Bad**
|
|
|
|
::
|
|
2014-01-26 15:36:14.256 28297 INFO nova.compute.manager [-]
|
|
Lifecycle event 1 on VM b1b8e5c7-12f0-4092-84f6-297fe7642070
|
|
|
|
General rule, when using constants or enums ensure they are translated
|
|
back to user strings prior to being sent to the user.
|
|
|
|
Specific Event Types
|
|
--------------------
|
|
|
|
In addition to the above guidelines very specific additional
|
|
requirements exist.
|
|
|
|
WSGI requests
|
|
~~~~~~~~~~~~~
|
|
|
|
Should be:
|
|
|
|
- Logged at **INFO** level
|
|
- Logged exactly once per request
|
|
- Include enough information to know what the request was
|
|
|
|
The last point is notable, because some POST API requests don't
|
|
include enough information in the URL alone to determine what the
|
|
API did. For instance, Nova Server Actions (where POST includes a
|
|
method name).
|
|
|
|
Rationale: Operators should be able to easily see what API requests
|
|
their users are making in their cloud to understand the usage patterns
|
|
of their users with their cloud.
|
|
|
|
Operator Deprecation Warnings
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
Should be:
|
|
|
|
- Logged at **WARN** level
|
|
- Logged exactly once per service start (not on every request through
|
|
code)
|
|
- Include directions on what to do to migrate from the deprecated
|
|
state
|
|
|
|
Rationale: Operators need to know that some aspect of their cloud
|
|
configuration is now deprecated, and will require changes in the
|
|
future. And they need enough of a bread crumb trail to figure out how
|
|
to do that.
|
|
|
|
REST API Deprecation Warnings
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
Should be:
|
|
|
|
- **Not** logged any higher than DEBUG (these are not operator facing
|
|
messages)
|
|
- Logged no more than once per REST API usage / tenant. Definitely
|
|
not on *every* REST API call.
|
|
|
|
Rationale: The users of the REST API don't have access to the system
|
|
logs. Therefore logging at a WARNING level is telling the wrong people
|
|
about the fact that they are using a deprecated API.
|
|
|
|
Deprecation of User facing API should be communicated via User facing
|
|
mechanisms, being API change notes associated with new API versions.
|
|
|
|
Stacktraces in Logs
|
|
~~~~~~~~~~~~~~~~~~~
|
|
|
|
Should be:
|
|
|
|
- **exceptional** events, for unforeseeable circumstance that is not
|
|
yet recoverable by the system.
|
|
- Logged at ERROR level
|
|
- Considered high priority bugs to be addressed by the development
|
|
team.
|
|
|
|
Rationale: The current behavior of OpenStack is extremely stack trace
|
|
happy. Many existing stack traces in the logs are considered
|
|
*normal*. This dramatically increases the time to find the root cause
|
|
of real issues in OpenStack.
|
|
|
|
|
|
Logging by non-OpenStack Components
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
OpenStack uses a ton of libraries, which have their own definitions of
|
|
logging. This causes a lot of extraneous information in normal logs by
|
|
wildly different definitions of those libraries.
|
|
|
|
As such, all 3rd party libraries should have their logging levels
|
|
adjusted so only real errors are logged.
|
|
|
|
Currently proposed settings for 3rd party libraries:
|
|
|
|
- amqp=WARN
|
|
- boto=WARN
|
|
- qpid=WARN
|
|
- sqlalchemy=WARN
|
|
- suds=INFO
|
|
- iso8601=WARN
|
|
- requests.packages.urllib3.connectionpool=WARN
|
|
- urllib3.connectionpool=WARN
|
|
|
|
|
|
|
|
Alternatives
|
|
------------
|
|
|
|
Continue to have terribly confusing logs
|
|
|
|
Data model impact
|
|
-----------------
|
|
|
|
NA
|
|
|
|
REST API impact
|
|
---------------
|
|
|
|
NA
|
|
|
|
Security impact
|
|
---------------
|
|
|
|
NA
|
|
|
|
Notifications impact
|
|
--------------------
|
|
|
|
NA
|
|
|
|
Other end user impact
|
|
---------------------
|
|
|
|
NA
|
|
|
|
Performance Impact
|
|
------------------
|
|
|
|
NA
|
|
|
|
Other deployer impact
|
|
---------------------
|
|
|
|
Should provide a much more standard way to determine what's going on
|
|
in the system.
|
|
|
|
Developer impact
|
|
----------------
|
|
|
|
Developers will need to be cognizant of these guidelines in creating
|
|
new code or reviewing code.
|
|
|
|
Implementation
|
|
==============
|
|
|
|
Assignee(s)
|
|
-----------
|
|
|
|
Assignee is for moving these guidelines through the review process to
|
|
something that we all agree on. The expectation is that these become
|
|
review criteria that we can reference and are implemented by a large
|
|
number of people. Once approved, will also drive collecting volunteers
|
|
to help fix in multiple projects.
|
|
|
|
Primary assignee:
|
|
Sean Dague <sean@dague.net>
|
|
|
|
Work Items
|
|
----------
|
|
Using this section to highlight things we need to decide that aren't
|
|
settled as of yet.
|
|
|
|
Proposed changes with general consensus
|
|
|
|
- Drop AUDIT log level, move all AUDIT message to either an INFO log
|
|
message or a ``notification``.
|
|
- Begin adjusting log levels within projects to match the severity
|
|
guidelines.
|
|
|
|
|
|
Dependencies
|
|
============
|
|
|
|
NA
|
|
|
|
Testing
|
|
=======
|
|
|
|
See tests provided by
|
|
https://blueprints.launchpad.net/nova/+spec/clean-logs
|
|
|
|
Documentation Impact
|
|
====================
|
|
|
|
Once agreed upon this should form a more permanent document on logging
|
|
specifications.
|
|
|
|
References
|
|
==========
|
|
|
|
- Security Log Guidelines -
|
|
https://wiki.openstack.org/wiki/Security/Guidelines/logging_guidelines
|
|
- Wiki page for basic logging standards proposal developed early in
|
|
Icehouse - https://wiki.openstack.org/wiki/LoggingStandards
|