Merge "Log Guidelines"
This commit is contained in:
commit
c3be753672
|
@ -0,0 +1,384 @@
|
|||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
==========================================
|
||||
Logging Guidelines
|
||||
==========================================
|
||||
|
||||
https://blueprints.launchpad.net/nova/+spec/log-guidelines
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
The current state of logging both within and between OpenStack
|
||||
components is inconsistent to the point of being somewhat harmful by
|
||||
obscuring the current state, function, and real cause of errors in an
|
||||
OpenStack cloud. A consistent, unified logging format will better
|
||||
enable cloud administrators to monitor and maintain their
|
||||
environments.
|
||||
|
||||
Before we can address this in OpenStack, we first need to come up with
|
||||
a set of guidelines that we can get broad agreement on. This is
|
||||
expected to happen in waves, and this is the first iteration to gather
|
||||
agreement on.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
Definition of Log Levels
|
||||
------------------------
|
||||
|
||||
http://stackoverflow.com/a/2031209
|
||||
This is a nice writeup about when to use each log level. Here is a
|
||||
brief description:
|
||||
|
||||
- Debug: Shows everything and is likely not suitable for normal
|
||||
production operation due to the sheer size of logs generated
|
||||
- Info: Usually indicates successful service start/stop, versions and
|
||||
such non-error related data. This should include largely positive
|
||||
units of work that are accomplished (such as starting a compute,
|
||||
creating a user, deleting a volume, etc.)
|
||||
- Audit: REMOVE - (all previous Audit messages should be put as INFO)
|
||||
- Warning: Indicates that there might be a systemic issue; potential
|
||||
predictive failure notice
|
||||
- Error: An error has occurred and an administrator should research
|
||||
the event
|
||||
- Critical: An error has occurred and the system might be unstable;
|
||||
immediately get administrator assistance
|
||||
|
||||
We can think of this from an operator perspective the following ways
|
||||
(Note: we are not specifying operator policy here, just trying to set
|
||||
tone for developers that aren't familiar with how these messages will
|
||||
be interpreted):
|
||||
|
||||
- Critical : ZOMG! Cluster on FIRE! Call all pagers, wake up
|
||||
everyone. This is an unrecoverable error with a service that has or
|
||||
probably will lead to service death or massive degredation.
|
||||
- Error: Serious issue with cloud, administrator should be notified
|
||||
immediately via email/pager. On call people expected to respond.
|
||||
- Warning: Something is not right, should get looked into during the
|
||||
next work week. Administrators should be working through eliminating
|
||||
warnings as part of normal work.
|
||||
- Info: normal status messages showing measureable units of positive
|
||||
work passing through under normal functioning of the system. Should
|
||||
not be so verbose as to overwhelm real signal with noise. Should not
|
||||
be continuous "I'm alive!" messages.
|
||||
- Debug: developer logging level, only enable if you are interested in
|
||||
reading through a ton of additional information about what is going on.
|
||||
|
||||
Proposed Changes From Status Quo
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
- Deprecate and remove AUDIT level
|
||||
|
||||
Rationale, AUDIT is confusing, and people use it for entirely the
|
||||
wrong purposes. The origin of AUDIT was a NASA specific requirement
|
||||
which is not longer really relevant to the current code.
|
||||
|
||||
Information that was previously being emitted at AUDIT should instead
|
||||
be sent as notifications to a notification queue. *Note: Notification formats
|
||||
and frequency are beyond the scope of this spec.*
|
||||
|
||||
Overall Logging Rules
|
||||
---------------------
|
||||
The following principles should apply to all messages
|
||||
|
||||
Log messages at Info and above should be a "unit of work"
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The Info log level is defined as: "normal status messages showing
|
||||
measureable units of positive work passing through under normal
|
||||
functioning of the system."
|
||||
|
||||
A measurable unit of work should be describable by a short sentence
|
||||
fragment, in the past tense with a noun and a verb of something
|
||||
significant.
|
||||
|
||||
Examples::
|
||||
|
||||
Instance spawned
|
||||
|
||||
Instance destroyed
|
||||
|
||||
Volume attached
|
||||
|
||||
Image failed to copy
|
||||
|
||||
Words like "started", "finished", or any verb ending in "ing" are
|
||||
flags for non unit of work messages.
|
||||
|
||||
Debugging start / end messages
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
At the Debug log level it is often extremely important to flag the
|
||||
beginning and ending of actions to track the progression of flows
|
||||
(which might error out before the unit of work is completed).
|
||||
|
||||
This should be made clear by there being a "starting" message with
|
||||
some indication of completion for that starting point.
|
||||
|
||||
In a real OpenStack environment lots of things are happening in
|
||||
parallel. There are multiple workers per services, multiple instances
|
||||
of services in the cloud.
|
||||
|
||||
Examples of Good and Bad uses of Info
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Below are some examples of good and bad users of info. In the Good
|
||||
examples we can see the 'noun / verb' fragment for a unit of work
|
||||
(successfully is probably superfluous and could be removed).
|
||||
|
||||
In the bad examples we see trace level thinking put into INFO and
|
||||
above messages.
|
||||
|
||||
**Good**
|
||||
|
||||
::
|
||||
2014-01-26 15:36:10.597 28297 INFO nova.virt.libvirt.driver [-]
|
||||
[instance: b1b8e5c7-12f0-4092-84f6-297fe7642070] Instance spawned
|
||||
successfully.
|
||||
|
||||
2014-01-26 15:36:14.307 28297 INFO nova.virt.libvirt.driver [-]
|
||||
[instance: b1b8e5c7-12f0-4092-84f6-297fe7642070] Instance destroyed
|
||||
successfully.
|
||||
|
||||
**Bad**
|
||||
|
||||
::
|
||||
2014-01-26 15:36:11.198 INFO nova.virt.libvirt.driver
|
||||
[req-ded67509-1e5d-4fb2-a0e2-92932bba9271
|
||||
FixedIPsNegativeTestXml-1426989627 FixedIPsNegativeTestXml-38506689]
|
||||
[instance: fd027464-6e15-4f5d-8b1f-c389bdb8772a] Creating image
|
||||
|
||||
2014-01-26 15:36:11.525 INFO nova.virt.libvirt.driver
|
||||
[req-ded67509-1e5d-4fb2-a0e2-92932bba9271
|
||||
FixedIPsNegativeTestXml-1426989627 FixedIPsNegativeTestXml-38506689]
|
||||
[instance: fd027464-6e15-4f5d-8b1f-c389bdb8772a] Using config drive
|
||||
|
||||
2014-01-26 15:36:12.326 AUDIT nova.compute.manager
|
||||
[req-714315e2-6318-4005-8f8f-05d7796ff45d FixedIPsTestXml-911165017
|
||||
FixedIPsTestXml-1315774890] [instance:
|
||||
b1b8e5c7-12f0-4092-84f6-297fe7642070] Terminating instance
|
||||
|
||||
2014-01-26 15:36:12.570 INFO nova.virt.libvirt.driver
|
||||
[req-ded67509-1e5d-4fb2-a0e2-92932bba9271
|
||||
FixedIPsNegativeTestXml-1426989627 FixedIPsNegativeTestXml-38506689]
|
||||
[instance: fd027464-6e15-4f5d-8b1f-c389bdb8772a] Creating config
|
||||
drive at
|
||||
/opt/stack/data/nova/instances/fd027464-6e15-4f5d-8b1f
|
||||
-c389bdb8772a/disk.config
|
||||
|
||||
This is mostly an overshare issue. At Info these are stages that don't
|
||||
really need to be fully communicated.
|
||||
|
||||
Messages shouldn't need a secret decoder ring
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
**Bad**
|
||||
|
||||
::
|
||||
2014-01-26 15:36:14.256 28297 INFO nova.compute.manager [-]
|
||||
Lifecycle event 1 on VM b1b8e5c7-12f0-4092-84f6-297fe7642070
|
||||
|
||||
General rule, when using constants or enums ensure they are translated
|
||||
back to user strings prior to being sent to the user.
|
||||
|
||||
Specific Event Types
|
||||
--------------------
|
||||
|
||||
In addition to the above guidelines very specific additional
|
||||
requirements exist.
|
||||
|
||||
WSGI requests
|
||||
~~~~~~~~~~~~~
|
||||
|
||||
Should be:
|
||||
|
||||
- Logged at **INFO** level
|
||||
- Logged exactly once per request
|
||||
- Include enough information to know what the request was
|
||||
|
||||
The last point is notable, because some POST API requests don't
|
||||
include enough information in the URL alone to determine what the
|
||||
API did. For instance, Nova Server Actions (where POST includes a
|
||||
method name).
|
||||
|
||||
Rationale: Operators should be able to easily see what API requests
|
||||
their users are making in their cloud to understand the usage patterns
|
||||
of their users with their cloud.
|
||||
|
||||
Operator Deprecation Warnings
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Should be:
|
||||
|
||||
- Logged at **WARN** level
|
||||
- Logged exactly once per service start (not on every request through
|
||||
code)
|
||||
- Include directions on what to do to migrate from the deprecated
|
||||
state
|
||||
|
||||
Rationale: Operators need to know that some aspect of their cloud
|
||||
configuration is now deprecated, and will require changes in the
|
||||
future. And they need enough of a bread crumb trail to figure out how
|
||||
to do that.
|
||||
|
||||
REST API Deprecation Warnings
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Should be:
|
||||
|
||||
- **Not** logged any higher than DEBUG (these are not operator facing
|
||||
messages)
|
||||
- Logged no more than once per REST API usage / tenant. Definitely
|
||||
not on *every* REST API call.
|
||||
|
||||
Rationale: The users of the REST API don't have access to the system
|
||||
logs. Therefore logging at a WARNING level is telling the wrong people
|
||||
about the fact that they are using a deprecated API.
|
||||
|
||||
Deprecation of User facing API should be communicated via User facing
|
||||
mechanisms, being API change notes associated with new API versions.
|
||||
|
||||
Stacktraces in Logs
|
||||
~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Should be:
|
||||
|
||||
- **exceptional** events, for unforeseeable circumstance that is not
|
||||
yet recoverable by the system.
|
||||
- Logged at ERROR level
|
||||
- Considered high priority bugs to be addressed by the development
|
||||
team.
|
||||
|
||||
Rationale: The current behavior of OpenStack is extremely stack trace
|
||||
happy. Many existing stack traces in the logs are considered
|
||||
*normal*. This dramatically increases the time to find the root cause
|
||||
of real issues in OpenStack.
|
||||
|
||||
|
||||
Logging by non-OpenStack Components
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
OpenStack uses a ton of libraries, which have their own definitions of
|
||||
logging. This causes a lot of extraneous information in normal logs by
|
||||
wildly different definitions of those libraries.
|
||||
|
||||
As such, all 3rd party libraries should have their logging levels
|
||||
adjusted so only real errors are logged.
|
||||
|
||||
Currently proposed settings for 3rd party libraries:
|
||||
|
||||
- amqp=WARN
|
||||
- boto=WARN
|
||||
- qpid=WARN
|
||||
- sqlalchemy=WARN
|
||||
- suds=INFO
|
||||
- iso8601=WARN
|
||||
- requests.packages.urllib3.connectionpool=WARN
|
||||
- urllib3.connectionpool=WARN
|
||||
|
||||
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
Continue to have terribly confusing logs
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
NA
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
NA
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
NA
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
NA
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
NA
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
NA
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
Should provide a much more standard way to determine what's going on
|
||||
in the system.
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
Developers will need to be cognizant of these guidelines in creating
|
||||
new code or reviewing code.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Assignee is for moving these guidelines through the review process to
|
||||
something that we all agree on. The expectation is that these become
|
||||
review criteria that we can reference and are implemented by a large
|
||||
number of people. Once approved, will also drive collecting volunteers
|
||||
to help fix in multiple projects.
|
||||
|
||||
Primary assignee:
|
||||
Sean Dague <sean@dague.net>
|
||||
|
||||
Work Items
|
||||
----------
|
||||
Using this section to highlight things we need to decide that aren't
|
||||
settled as of yet.
|
||||
|
||||
Proposed changes with general consensus
|
||||
|
||||
- Drop AUDIT log level, move all AUDIT message to either an INFO log
|
||||
message or a ``notification``.
|
||||
- Begin adjusting log levels within projects to match the severity
|
||||
guidelines.
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
NA
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
See tests provided by
|
||||
https://blueprints.launchpad.net/nova/+spec/clean-logs
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
Once agreed upon this should form a more permanent document on logging
|
||||
specifications.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
- Security Log Guidelines -
|
||||
https://wiki.openstack.org/wiki/Security/Guidelines/logging_guidelines
|
||||
- Wiki page for basic logging standards proposal developed early in
|
||||
Icehouse - https://wiki.openstack.org/wiki/LoggingStandards
|
Loading…
Reference in New Issue