diff --git a/specs/log-guidelines.rst b/specs/log-guidelines.rst new file mode 100644 index 0000000..291b760 --- /dev/null +++ b/specs/log-guidelines.rst @@ -0,0 +1,384 @@ +.. + This work is licensed under a Creative Commons Attribution 3.0 Unported + License. + + http://creativecommons.org/licenses/by/3.0/legalcode + +========================================== +Logging Guidelines +========================================== + +https://blueprints.launchpad.net/nova/+spec/log-guidelines + +Problem description +=================== + +The current state of logging both within and between OpenStack +components is inconsistent to the point of being somewhat harmful by +obscuring the current state, function, and real cause of errors in an +OpenStack cloud. A consistent, unified logging format will better +enable cloud administrators to monitor and maintain their +environments. + +Before we can address this in OpenStack, we first need to come up with +a set of guidelines that we can get broad agreement on. This is +expected to happen in waves, and this is the first iteration to gather +agreement on. + +Proposed change +=============== + +Definition of Log Levels +------------------------ + +http://stackoverflow.com/a/2031209 +This is a nice writeup about when to use each log level. Here is a +brief description: + +- Debug: Shows everything and is likely not suitable for normal + production operation due to the sheer size of logs generated +- Info: Usually indicates successful service start/stop, versions and + such non-error related data. This should include largely positive + units of work that are accomplished (such as starting a compute, + creating a user, deleting a volume, etc.) +- Audit: REMOVE - (all previous Audit messages should be put as INFO) +- Warning: Indicates that there might be a systemic issue; potential + predictive failure notice +- Error: An error has occurred and an administrator should research + the event +- Critical: An error has occurred and the system might be unstable; + immediately get administrator assistance + +We can think of this from an operator perspective the following ways +(Note: we are not specifying operator policy here, just trying to set +tone for developers that aren't familiar with how these messages will +be interpreted): + +- Critical : ZOMG! Cluster on FIRE! Call all pagers, wake up + everyone. This is an unrecoverable error with a service that has or + probably will lead to service death or massive degredation. +- Error: Serious issue with cloud, administrator should be notified + immediately via email/pager. On call people expected to respond. +- Warning: Something is not right, should get looked into during the + next work week. Administrators should be working through eliminating + warnings as part of normal work. +- Info: normal status messages showing measureable units of positive + work passing through under normal functioning of the system. Should + not be so verbose as to overwhelm real signal with noise. Should not + be continuous "I'm alive!" messages. +- Debug: developer logging level, only enable if you are interested in + reading through a ton of additional information about what is going on. + +Proposed Changes From Status Quo +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +- Deprecate and remove AUDIT level + +Rationale, AUDIT is confusing, and people use it for entirely the +wrong purposes. The origin of AUDIT was a NASA specific requirement +which is not longer really relevant to the current code. + +Information that was previously being emitted at AUDIT should instead +be sent as notifications to a notification queue. *Note: Notification formats +and frequency are beyond the scope of this spec.* + +Overall Logging Rules +--------------------- +The following principles should apply to all messages + +Log messages at Info and above should be a "unit of work" +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The Info log level is defined as: "normal status messages showing +measureable units of positive work passing through under normal +functioning of the system." + +A measurable unit of work should be describable by a short sentence +fragment, in the past tense with a noun and a verb of something +significant. + +Examples:: + + Instance spawned + + Instance destroyed + + Volume attached + + Image failed to copy + +Words like "started", "finished", or any verb ending in "ing" are +flags for non unit of work messages. + +Debugging start / end messages +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +At the Debug log level it is often extremely important to flag the +beginning and ending of actions to track the progression of flows +(which might error out before the unit of work is completed). + +This should be made clear by there being a "starting" message with +some indication of completion for that starting point. + +In a real OpenStack environment lots of things are happening in +parallel. There are multiple workers per services, multiple instances +of services in the cloud. + +Examples of Good and Bad uses of Info +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Below are some examples of good and bad users of info. In the Good +examples we can see the 'noun / verb' fragment for a unit of work +(successfully is probably superfluous and could be removed). + +In the bad examples we see trace level thinking put into INFO and +above messages. + +**Good** + +:: + 2014-01-26 15:36:10.597 28297 INFO nova.virt.libvirt.driver [-] + [instance: b1b8e5c7-12f0-4092-84f6-297fe7642070] Instance spawned + successfully. + + 2014-01-26 15:36:14.307 28297 INFO nova.virt.libvirt.driver [-] + [instance: b1b8e5c7-12f0-4092-84f6-297fe7642070] Instance destroyed + successfully. + +**Bad** + +:: + 2014-01-26 15:36:11.198 INFO nova.virt.libvirt.driver + [req-ded67509-1e5d-4fb2-a0e2-92932bba9271 + FixedIPsNegativeTestXml-1426989627 FixedIPsNegativeTestXml-38506689] + [instance: fd027464-6e15-4f5d-8b1f-c389bdb8772a] Creating image + + 2014-01-26 15:36:11.525 INFO nova.virt.libvirt.driver + [req-ded67509-1e5d-4fb2-a0e2-92932bba9271 + FixedIPsNegativeTestXml-1426989627 FixedIPsNegativeTestXml-38506689] + [instance: fd027464-6e15-4f5d-8b1f-c389bdb8772a] Using config drive + + 2014-01-26 15:36:12.326 AUDIT nova.compute.manager + [req-714315e2-6318-4005-8f8f-05d7796ff45d FixedIPsTestXml-911165017 + FixedIPsTestXml-1315774890] [instance: + b1b8e5c7-12f0-4092-84f6-297fe7642070] Terminating instance + + 2014-01-26 15:36:12.570 INFO nova.virt.libvirt.driver + [req-ded67509-1e5d-4fb2-a0e2-92932bba9271 + FixedIPsNegativeTestXml-1426989627 FixedIPsNegativeTestXml-38506689] + [instance: fd027464-6e15-4f5d-8b1f-c389bdb8772a] Creating config + drive at + /opt/stack/data/nova/instances/fd027464-6e15-4f5d-8b1f + -c389bdb8772a/disk.config + +This is mostly an overshare issue. At Info these are stages that don't +really need to be fully communicated. + +Messages shouldn't need a secret decoder ring +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +**Bad** + +:: + 2014-01-26 15:36:14.256 28297 INFO nova.compute.manager [-] + Lifecycle event 1 on VM b1b8e5c7-12f0-4092-84f6-297fe7642070 + +General rule, when using constants or enums ensure they are translated +back to user strings prior to being sent to the user. + +Specific Event Types +-------------------- + +In addition to the above guidelines very specific additional +requirements exist. + +WSGI requests +~~~~~~~~~~~~~ + +Should be: + +- Logged at **INFO** level +- Logged exactly once per request +- Include enough information to know what the request was + +The last point is notable, because some POST API requests don't +include enough information in the URL alone to determine what the +API did. For instance, Nova Server Actions (where POST includes a +method name). + +Rationale: Operators should be able to easily see what API requests +their users are making in their cloud to understand the usage patterns +of their users with their cloud. + +Operator Deprecation Warnings +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Should be: + +- Logged at **WARN** level +- Logged exactly once per service start (not on every request through + code) +- Include directions on what to do to migrate from the deprecated + state + +Rationale: Operators need to know that some aspect of their cloud +configuration is now deprecated, and will require changes in the +future. And they need enough of a bread crumb trail to figure out how +to do that. + +REST API Deprecation Warnings +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Should be: + +- **Not** logged any higher than DEBUG (these are not operator facing + messages) +- Logged no more than once per REST API usage / tenant. Definitely + not on *every* REST API call. + +Rationale: The users of the REST API don't have access to the system +logs. Therefore logging at a WARNING level is telling the wrong people +about the fact that they are using a deprecated API. + +Deprecation of User facing API should be communicated via User facing +mechanisms, being API change notes associated with new API versions. + +Stacktraces in Logs +~~~~~~~~~~~~~~~~~~~ + +Should be: + +- **exceptional** events, for unforeseeable circumstance that is not + yet recoverable by the system. +- Logged at ERROR level +- Considered high priority bugs to be addressed by the development + team. + +Rationale: The current behavior of OpenStack is extremely stack trace +happy. Many existing stack traces in the logs are considered +*normal*. This dramatically increases the time to find the root cause +of real issues in OpenStack. + + +Logging by non-OpenStack Components +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +OpenStack uses a ton of libraries, which have their own definitions of +logging. This causes a lot of extraneous information in normal logs by +wildly different definitions of those libraries. + +As such, all 3rd party libraries should have their logging levels +adjusted so only real errors are logged. + +Currently proposed settings for 3rd party libraries: + +- amqp=WARN +- boto=WARN +- qpid=WARN +- sqlalchemy=WARN +- suds=INFO +- iso8601=WARN +- requests.packages.urllib3.connectionpool=WARN +- urllib3.connectionpool=WARN + + + +Alternatives +------------ + +Continue to have terribly confusing logs + +Data model impact +----------------- + +NA + +REST API impact +--------------- + +NA + +Security impact +--------------- + +NA + +Notifications impact +-------------------- + +NA + +Other end user impact +--------------------- + +NA + +Performance Impact +------------------ + +NA + +Other deployer impact +--------------------- + +Should provide a much more standard way to determine what's going on +in the system. + +Developer impact +---------------- + +Developers will need to be cognizant of these guidelines in creating +new code or reviewing code. + +Implementation +============== + +Assignee(s) +----------- + +Assignee is for moving these guidelines through the review process to +something that we all agree on. The expectation is that these become +review criteria that we can reference and are implemented by a large +number of people. Once approved, will also drive collecting volunteers +to help fix in multiple projects. + +Primary assignee: + Sean Dague + +Work Items +---------- +Using this section to highlight things we need to decide that aren't +settled as of yet. + +Proposed changes with general consensus + +- Drop AUDIT log level, move all AUDIT message to either an INFO log + message or a ``notification``. +- Begin adjusting log levels within projects to match the severity + guidelines. + + +Dependencies +============ + +NA + +Testing +======= + +See tests provided by +https://blueprints.launchpad.net/nova/+spec/clean-logs + +Documentation Impact +==================== + +Once agreed upon this should form a more permanent document on logging +specifications. + +References +========== + +- Security Log Guidelines - + https://wiki.openstack.org/wiki/Security/Guidelines/logging_guidelines +- Wiki page for basic logging standards proposal developed early in + Icehouse - https://wiki.openstack.org/wiki/LoggingStandards