Classify tempest-devstack failures using ElasticSearch
Go to file
Sean Dague 07aaba10ed added ER query for swift not starting in grenade
some times swift doesn't successfully start after a grenade
upgrade, seen 3 times this week.

Change-Id: Ib08d3a6c8d26fed83db12bd0d06896d38eb94f59
2014-01-11 20:53:06 -05:00
doc/source Document adding bug signatures to e-r. 2013-12-12 12:34:54 -08:00
elastic_recheck Merge "add per job percentages for responsibility of fails" 2013-12-19 12:35:43 +00:00
queries added ER query for swift not starting in grenade 2014-01-11 20:53:06 -05:00
web Add graph for gate hit count 2013-12-16 11:04:59 -08:00
.coveragerc Apply Cookiecutter to the repo. 2013-09-23 15:27:39 -07:00
.gitignore Apply Cookiecutter to the repo. 2013-09-23 15:27:39 -07:00
.gitreview Apply Cookiecutter to the repo. 2013-09-23 15:27:39 -07:00
.testr.conf Apply Cookiecutter to the repo. 2013-09-23 15:27:39 -07:00
babel.cfg Apply Cookiecutter to the repo. 2013-09-23 15:27:39 -07:00
CONTRIBUTING.rst Apply Cookiecutter to the repo. 2013-09-23 15:27:39 -07:00
elasticRecheck.conf.sample move queries.yaml into a queries subdir 2013-12-02 11:43:00 -05:00
LICENSE Apply Cookiecutter to the repo. 2013-09-23 15:27:39 -07:00
MANIFEST.in Apply Cookiecutter to the repo. 2013-09-23 15:27:39 -07:00
README.rst Add some documentation on wildcard limitations in queries 2014-01-07 11:57:38 -08:00
recheckwatchbot.yaml Make bot.py behave like a daemon 2013-09-18 17:45:12 -04:00
requirements.txt Make pid file configurable 2013-09-30 10:29:32 -07:00
setup.cfg add support for installing the web dashboard 2013-12-03 10:41:21 -08:00
setup.py Apply Cookiecutter to the repo. 2013-09-23 15:27:39 -07:00
test-requirements.txt Cap Sphinx at <1.2 to avoid distutils problems. 2013-12-10 16:34:44 -08:00
tox.ini Tox allow install of lazr.authentication 2014-01-06 02:50:14 +00:00

elastic-recheck

"Use ElasticSearch to classify OpenStack gate failures"

Idea

Identifying the specific bug that is causing a transient error in the gate is very hard. Just identifying which tempest test failed is not enough because a single bug can potentially cause multiple tempest tests to fail. If we can find a fingerprint for a specific bug using logs, then we can use ElasticSearch to automatically detect any occurrences of the bug.

Using these fingerprints elastic-recheck can:

  • Search ElasticSearch for all occurrences of a bug.
  • Identify bug trends such as: when it started, is the bug fixed, is it getting worse, etc.
  • Classify bug failures in real time and report back to gerrit if we find a match, so a patch author knows why the test failed.

queries/

All queries are stored in separate yaml files in a queries directory at the top of the elastic-recheck code base. The format of these files is ######.yaml (where ###### is the launchpad bug number), the yaml should have a query keyword which is the query text for elastic search.

Guidelines for good queries

  • After a bug is resolved and has no more hits in elasticsearch, we should flag it with a resolved_at keyword. This will let us keep some memory of past bugs, and see if they come back.
  • Queries should get as close as possible to fingerprinting the root cause
  • Queries should not return any hits for successful jobs, this is a sign the query isn't specific enough

In order to support rapidly added queries, it's considered socially acceptable to +A changes that only add 1 new bug query, and to even self approve those changes by core reviewers.

Adding Bug Signatures

Most transient bugs seen in gate are not bugs in tempest associated with a specific tempest test failure, but rather some sort of issue further down the stack that can cause many tempest tests to fail.

  1. Given a transient bug that is seen during the gate, go through the logs (logs.openstack.org) and try to find a log that is associated with the failure. The closer to the root cause the better.

  2. Go to logstash.openstack.org and create an elastic search query to find the log message from step 1. To see the possible fields to search on click on an entry. Lucene query syntax is available at http://lucene.apache.org/core/4_0_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description

    Note that wildcard analysis is disabled by default in ElasticSearch so while a query in logstash might work with wildcards, it will not work in elastic-recheck. See the ElasticSearch documentation for more information on wildcard analysis: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html#_wildcards

  3. Add a comment to the bug with the query you identified and a link to the logstash url for that query search.

  4. Add the query to elastic-recheck/queries/BUGNUMBER.yaml and push the patch up for review. https://git.openstack.org/cgit/openstack-infra/elastic-recheck/tree/queries

Future Work

  • Move config files into a separate directory
  • Make unit tests robust
  • Add debug mode flag
  • Expand gating testing
  • Cleanup and document code better
  • Add ability to check if any resolved bugs return
  • Move away from polling ElasticSearch to discover if its ready or not
  • Add nightly job to propose a patch to remove bug queries that return no hits -- Bug hasn't been seen in 2 weeks and must be closed