Classify tempest-devstack failures using ElasticSearch
Go to file
Joe Gordon 8856c6db90 Add query for bug 1257626
Regression that  is causing gate-tempest-dsvm-large-op to timeout and
fail.

Change-Id: I206233c04dd10b64f9cf06cd2c52ed68068acfb2
2013-12-04 10:13:31 +02:00
doc/source Apply Cookiecutter to the repo. 2013-09-23 15:27:39 -07:00
elastic_recheck Merge "Update job names" 2013-12-03 10:04:58 +00:00
queries Add query for bug 1257626 2013-12-04 10:13:31 +02:00
.coveragerc Apply Cookiecutter to the repo. 2013-09-23 15:27:39 -07:00
.gitignore Apply Cookiecutter to the repo. 2013-09-23 15:27:39 -07:00
.gitreview Apply Cookiecutter to the repo. 2013-09-23 15:27:39 -07:00
.testr.conf Apply Cookiecutter to the repo. 2013-09-23 15:27:39 -07:00
CONTRIBUTING.rst Apply Cookiecutter to the repo. 2013-09-23 15:27:39 -07:00
LICENSE Apply Cookiecutter to the repo. 2013-09-23 15:27:39 -07:00
MANIFEST.in Apply Cookiecutter to the repo. 2013-09-23 15:27:39 -07:00
README.rst wrapping the README.rst file to 80 cols 2013-12-02 11:43:51 -05:00
babel.cfg Apply Cookiecutter to the repo. 2013-09-23 15:27:39 -07:00
elasticRecheck.conf.sample move queries.yaml into a queries subdir 2013-12-02 11:43:00 -05:00
recheckwatchbot.yaml Make bot.py behave like a daemon 2013-09-18 17:45:12 -04:00
requirements.txt Make pid file configurable 2013-09-30 10:29:32 -07:00
setup.cfg Add graph script 2013-10-02 14:56:49 -07:00
setup.py Apply Cookiecutter to the repo. 2013-09-23 15:27:39 -07:00
test-requirements.txt Add mox fixture to base TestCase 2013-10-01 18:05:33 -04:00
tox.ini Fix E122,E126,E128 items in codebase 2013-12-02 11:43:51 -05:00

README.rst

elastic-recheck

"Classify tempest-devstack failures using ElasticSearch"

Idea

When a tempest job failure is detected, by monitoring gerrit (using gerritlib), a collection of logstash queries will be run on the failed job to detect what the bug was.

Eventually this can be tied into the rechecker tool and launchpad

queries/

All queries are stored in separate yaml files in a queries directory at the top of the elastic_recheck code base. The format of these files is ######.yaml (where ###### is the bug number), the yaml should have a query keyword which is the query text for elastic search.

Guidelines for good queries

  • After a bug is resolved and has no more hits in elasticsearch, we should flag it with a resolved_at keyword. This will let us keep some memory of past bugs, and see if they come back. (Note: this is a forward looking statement, sorting out resolved_at will come in the future)
  • Queries should get as close as possible to fingerprinting the root cause
  • Queries should not return any hits for successful jobs, this is a sign the query isn't specific enough

In order to support rapidly added queries, it's considered socially acceptable to +A changes that only add 1 new bug query, and to even self approve those changes by core reviewers.

Future Work

  • Move config files into a separate directory
  • Make unit tests robust
  • Add debug mode flag
  • Expand gating testing
  • Cleanup and document code better
  • Sort out resolved_at stamping to remove active bugs
  • Move away from polling ElasticSearch to discover if its ready or not
  • Add nightly job to propose a patch to remove bug queries that return no hits -- Bug hasn't been seen in 2 weeks and must be closed
  • implement resolved_at in loader

Main Dependencies

  • gerritlib
  • pyelasticsearch