Classify tempest-devstack failures using ElasticSearch
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
Matt Riedemann 6b60f3b09b Add e-r query for bug 1258682 7 years ago
doc/source Document adding bug signatures to e-r. 7 years ago
elastic_recheck make launchpad integration optional 7 years ago
queries Add e-r query for bug 1258682 7 years ago
web refactor graphite stanzas for readability 7 years ago
.coveragerc Apply Cookiecutter to the repo. 7 years ago
.gitignore Apply Cookiecutter to the repo. 7 years ago
.gitreview Apply Cookiecutter to the repo. 7 years ago
.testr.conf Apply Cookiecutter to the repo. 7 years ago
CONTRIBUTING.rst Apply Cookiecutter to the repo. 7 years ago
LICENSE Apply Cookiecutter to the repo. 7 years ago
MANIFEST.in Apply Cookiecutter to the repo. 7 years ago
README.rst Document adding bug signatures to e-r. 7 years ago
babel.cfg Apply Cookiecutter to the repo. 7 years ago
elasticRecheck.conf.sample move queries.yaml into a queries subdir 7 years ago
recheckwatchbot.yaml Make bot.py behave like a daemon 7 years ago
requirements.txt Make pid file configurable 7 years ago
setup.cfg add support for installing the web dashboard 7 years ago
setup.py Apply Cookiecutter to the repo. 7 years ago
test-requirements.txt Cap Sphinx at <1.2 to avoid distutils problems. 7 years ago
tox.ini Fix E122,E126,E128 items in codebase 7 years ago

README.rst

elastic-recheck

"Classify tempest-devstack failures using ElasticSearch"

Idea

When a tempest job failure is detected, by monitoring gerrit (using gerritlib), a collection of logstash queries will be run on the failed job to detect what the bug was.

Eventually this can be tied into the rechecker tool and launchpad

queries/

All queries are stored in separate yaml files in a queries directory at the top of the elastic_recheck code base. The format of these files is ######.yaml (where ###### is the bug number), the yaml should have a query keyword which is the query text for elastic search.

Guidelines for good queries

  • After a bug is resolved and has no more hits in elasticsearch, we should flag it with a resolved_at keyword. This will let us keep some memory of past bugs, and see if they come back. (Note: this is a forward looking statement, sorting out resolved_at will come in the future)
  • Queries should get as close as possible to fingerprinting the root cause
  • Queries should not return any hits for successful jobs, this is a sign the query isn't specific enough

In order to support rapidly added queries, it's considered socially acceptable to +A changes that only add 1 new bug query, and to even self approve those changes by core reviewers.

Adding Bug Signatures

Most transient bugs seen in gate are not bugs in tempest associated with a specific tempest test failure, but rather some sort of issue further down the stack that can cause many tempest tests to fail.

  1. Given a transient bug that is seen during the gate, go through the logs (logs.openstack.org) and try to find a log that is associated with the failure. The closer to the root cause the better.
  2. Go to logstash.openstack.org and create an elastic search query to find the log message from step 1. To see the possible fields to search on click on an entry. Lucene query syntax is available at http://lucene.apache.org/core/4_0_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description
  3. Add a comment to the bug with the query you identified and a link to the logstash url for that query search.
  4. Add the query to elastic-recheck/queries/BUGNUMBER.yaml and push the patch up for review. https://git.openstack.org/cgit/openstack-infra/elastic-recheck/tree/queries

Future Work

  • Move config files into a separate directory
  • Make unit tests robust
  • Add debug mode flag
  • Expand gating testing
  • Cleanup and document code better
  • Sort out resolved_at stamping to remove active bugs
  • Move away from polling ElasticSearch to discover if its ready or not
  • Add nightly job to propose a patch to remove bug queries that return no hits -- Bug hasn't been seen in 2 weeks and must be closed
  • implement resolved_at in loader

Main Dependencies

  • gerritlib
  • pyelasticsearch