Classify tempest-devstack failures using ElasticSearch
Go to file
Attila Fazekas 4ce12c70a5 Machine remins in HARD_REBOOT status
Adding query for #1224518.

Change-Id: I8edbbc94d0c1255c49e88dcd9307e33652959d09
2013-12-12 17:48:38 +01:00
doc/source Apply Cookiecutter to the repo. 2013-09-23 15:27:39 -07:00
elastic_recheck make launchpad integration optional 2013-12-11 19:47:39 -05:00
queries Machine remins in HARD_REBOOT status 2013-12-12 17:48:38 +01:00
web refactor graphite stanzas for readability 2013-12-11 09:19:47 -05:00
.coveragerc Apply Cookiecutter to the repo. 2013-09-23 15:27:39 -07:00
.gitignore Apply Cookiecutter to the repo. 2013-09-23 15:27:39 -07:00
.gitreview Apply Cookiecutter to the repo. 2013-09-23 15:27:39 -07:00
.testr.conf Apply Cookiecutter to the repo. 2013-09-23 15:27:39 -07:00
babel.cfg Apply Cookiecutter to the repo. 2013-09-23 15:27:39 -07:00
CONTRIBUTING.rst Apply Cookiecutter to the repo. 2013-09-23 15:27:39 -07:00
elasticRecheck.conf.sample move queries.yaml into a queries subdir 2013-12-02 11:43:00 -05:00
LICENSE Apply Cookiecutter to the repo. 2013-09-23 15:27:39 -07:00
MANIFEST.in Apply Cookiecutter to the repo. 2013-09-23 15:27:39 -07:00
README.rst wrapping the README.rst file to 80 cols 2013-12-02 11:43:51 -05:00
recheckwatchbot.yaml Make bot.py behave like a daemon 2013-09-18 17:45:12 -04:00
requirements.txt Make pid file configurable 2013-09-30 10:29:32 -07:00
setup.cfg add support for installing the web dashboard 2013-12-03 10:41:21 -08:00
setup.py Apply Cookiecutter to the repo. 2013-09-23 15:27:39 -07:00
test-requirements.txt Cap Sphinx at <1.2 to avoid distutils problems. 2013-12-10 16:34:44 -08:00
tox.ini Fix E122,E126,E128 items in codebase 2013-12-02 11:43:51 -05:00

elastic-recheck

"Classify tempest-devstack failures using ElasticSearch"

Idea

When a tempest job failure is detected, by monitoring gerrit (using gerritlib), a collection of logstash queries will be run on the failed job to detect what the bug was.

Eventually this can be tied into the rechecker tool and launchpad

queries/

All queries are stored in separate yaml files in a queries directory at the top of the elastic_recheck code base. The format of these files is ######.yaml (where ###### is the bug number), the yaml should have a query keyword which is the query text for elastic search.

Guidelines for good queries

  • After a bug is resolved and has no more hits in elasticsearch, we should flag it with a resolved_at keyword. This will let us keep some memory of past bugs, and see if they come back. (Note: this is a forward looking statement, sorting out resolved_at will come in the future)
  • Queries should get as close as possible to fingerprinting the root cause
  • Queries should not return any hits for successful jobs, this is a sign the query isn't specific enough

In order to support rapidly added queries, it's considered socially acceptable to +A changes that only add 1 new bug query, and to even self approve those changes by core reviewers.

Future Work

  • Move config files into a separate directory
  • Make unit tests robust
  • Add debug mode flag
  • Expand gating testing
  • Cleanup and document code better
  • Sort out resolved_at stamping to remove active bugs
  • Move away from polling ElasticSearch to discover if its ready or not
  • Add nightly job to propose a patch to remove bug queries that return no hits -- Bug hasn't been seen in 2 weeks and must be closed
  • implement resolved_at in loader

Main Dependencies

  • gerritlib
  • pyelasticsearch