Classify tempest-devstack failures using ElasticSearch

Go to file

Sean Dague 3c6281a7d4 this query is too broad to be useful And we have better queries now that we are indexing horizon_error.log Change-Id: I2eac509325186236a4f3a7a6ace45eacce563eed Related-Bug: #1265057		2014-06-23 11:18:29 -04:00
doc/source	Document adding bug signatures to e-r.	2013-12-12 12:34:54 -08:00
elastic_recheck	Make the #1235437 query time limit change resitent	2014-06-18 13:01:20 +02:00
queries	this query is too broad to be useful	2014-06-23 11:18:29 -04:00
tools	add a tool for determining unaccounted for rechecks	2014-06-10 18:24:09 -04:00
web	Add link to elastic-recheck docs to index	2014-06-06 10:38:32 -07:00
.coveragerc	Apply Cookiecutter to the repo.	2013-09-23 15:27:39 -07:00
.gitignore	Apply Cookiecutter to the repo.	2013-09-23 15:27:39 -07:00
.gitreview	Apply Cookiecutter to the repo.	2013-09-23 15:27:39 -07:00
.testr.conf	Apply Cookiecutter to the repo.	2013-09-23 15:27:39 -07:00
CONTRIBUTING.rst	Apply Cookiecutter to the repo.	2013-09-23 15:27:39 -07:00
LICENSE	Apply Cookiecutter to the repo.	2013-09-23 15:27:39 -07:00
MANIFEST.in	Apply Cookiecutter to the repo.	2013-09-23 15:27:39 -07:00
README.rst	Update readme with more guidelines for writing queries	2014-06-19 06:03:04 -07:00
babel.cfg	Apply Cookiecutter to the repo.	2013-09-23 15:27:39 -07:00
elasticRecheck.conf.sample	move queries.yaml into a queries subdir	2013-12-02 11:43:00 -05:00
recheckwatchbot.yaml	Add multi-project irc support to the bot	2014-01-24 12:21:47 -05:00
requirements.txt	python-dateutil requires six, be explicit about it	2014-03-02 06:48:41 -05:00
setup.cfg	add uncategorized failure generation code	2014-01-17 09:35:40 -05:00
setup.py	Apply Cookiecutter to the repo.	2013-09-23 15:27:39 -07:00
test-requirements.txt	Add fingerprint for bug 1274056	2014-01-29 15:48:29 +04:00
tox.ini	Add tox target for building docs	2014-06-18 08:06:05 -07:00

README.rst

elastic-recheck

"Use ElasticSearch to classify OpenStack gate failures"

Open Source Software: Apache license

Idea

Identifying the specific bug that is causing a transient error in the gate is very hard. Just identifying which tempest test failed is not enough because a single bug can potentially cause multiple tempest tests to fail. If we can find a fingerprint for a specific bug using logs, then we can use ElasticSearch to automatically detect any occurrences of the bug.

Using these fingerprints elastic-recheck can:

Search ElasticSearch for all occurrences of a bug.
Identify bug trends such as: when it started, is the bug fixed, is it getting worse, etc.
Classify bug failures in real time and report back to gerrit if we find a match, so a patch author knows why the test failed.

queries/

All queries are stored in separate yaml files in a queries directory at the top of the elastic-recheck code base. The format of these files is ######.yaml (where ###### is the launchpad bug number), the yaml should have a query keyword which is the query text for elastic search.

Guidelines for good queries:

Queries should get as close as possible to fingerprinting the root cause. A filename query is typically better than a console one, as that's matching a deep failure versus a surface symptom.
Queries should not return any hits for successful jobs, this is a sign the query isn't specific enough. A rule of thumb is > 10% success hits probably means this isn't good enough.
If it's impossible to build a query to target a bug, consider patching the upstream program to be explicit when it fails in a particular way.
Use the 'tags' field rather than the 'filename' field for filtering. This is primarily because of grenade jobs where the same log file shows up in the 'old' and 'new' side of the grenade job. For example, tags:"screen-n-cpu.txt" will query in logs/old/screen-n-cpu.txt and logs/new/screen-n-cpu.txt. The tags:"console" filter is also used to query in console.html as well as tempest and devstack logs.
Avoid the use of wildcards in queries since they can put an undue burden on the query engine. A common case where wildcards would be useful are in querying against a specific set of build_name fields, e.g. gate-nova-python26 and gate-nova-python27. Rather than use build_name:gate-nova-python*, list the jobs with an OR, e.g.:
```
(build_name:"gate-nova-python26" OR build_name:"gate-nova-python27")
```

In order to support rapidly added queries, it's considered socially acceptable to +A changes that only add 1 new bug query, and to even self approve those changes by core reviewers.

Adding Bug Signatures

Most transient bugs seen in gate are not bugs in tempest associated with a specific tempest test failure, but rather some sort of issue further down the stack that can cause many tempest tests to fail.

Given a transient bug that is seen during the gate, go through the logs (logs.openstack.org) and try to find a log that is associated with the failure. The closer to the root cause the better.

Note that queries can only be written against INFO level and higher log messages. This is by design to not overwhelm the search cluster.
Go to logstash.openstack.org and create an elastic search query to find the log message from step 1. To see the possible fields to search on click on an entry. Lucene query syntax is available at http://lucene.apache.org/core/4_0_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description
Add a comment to the bug with the query you identified and a link to the logstash url for that query search.
Add the query to elastic-recheck/queries/BUGNUMBER.yaml and push the patch up for review. https://git.openstack.org/cgit/openstack-infra/elastic-recheck/tree/queries

Future Work

Move config files into a separate directory
Make unit tests robust
Add debug mode flag
Expand gating testing
Cleanup and document code better
Add ability to check if any resolved bugs return
Move away from polling ElasticSearch to discover if its ready or not
Add nightly job to propose a patch to remove bug queries that return no hits -- Bug hasn't been seen in 2 weeks and must be closed