|
6 years ago | |
---|---|---|
doc/source | 7 years ago | |
elastic_recheck | 6 years ago | |
queries | 6 years ago | |
tools | 6 years ago | |
web | 6 years ago | |
.coveragerc | 7 years ago | |
.gitignore | 7 years ago | |
.gitreview | 7 years ago | |
.testr.conf | 7 years ago | |
CONTRIBUTING.rst | 7 years ago | |
LICENSE | 7 years ago | |
MANIFEST.in | 7 years ago | |
README.rst | 6 years ago | |
babel.cfg | 7 years ago | |
elasticRecheck.conf.sample | 7 years ago | |
recheckwatchbot.yaml | 7 years ago | |
requirements.txt | 6 years ago | |
setup.cfg | 7 years ago | |
setup.py | 7 years ago | |
test-requirements.txt | 7 years ago | |
tox.ini | 6 years ago |
"Use ElasticSearch to classify OpenStack gate failures"
Identifying the specific bug that is causing a transient error in the gate is very hard. Just identifying which tempest test failed is not enough because a single bug can potentially cause multiple tempest tests to fail. If we can find a fingerprint for a specific bug using logs, then we can use ElasticSearch to automatically detect any occurrences of the bug.
Using these fingerprints elastic-recheck can:
All queries are stored in separate yaml files in a queries directory at the top of the elastic-recheck code base. The format of these files is ######.yaml (where ###### is the launchpad bug number), the yaml should have a query
keyword which is the query text for elastic search.
Guidelines for good queries:
Avoid the use of wildcards in queries since they can put an undue burden on the query engine. A common case where wildcards would be useful are in querying against a specific set of build_name fields, e.g. gate-nova-python26 and gate-nova-python27. Rather than use build_name:gate-nova-python*, list the jobs with an OR, e.g.:
(build_name:"gate-nova-python26" OR build_name:"gate-nova-python27")
In order to support rapidly added queries, it's considered socially acceptable to +A changes that only add 1 new bug query, and to even self approve those changes by core reviewers.
Most transient bugs seen in gate are not bugs in tempest associated with a specific tempest test failure, but rather some sort of issue further down the stack that can cause many tempest tests to fail.
Given a transient bug that is seen during the gate, go through the logs (logs.openstack.org) and try to find a log that is associated with the failure. The closer to the root cause the better.
Note that queries can only be written against INFO level and higher log messages. This is by design to not overwhelm the search cluster.
elastic-recheck/queries/BUGNUMBER.yaml
and push the patch up for review. https://git.openstack.org/cgit/openstack-infra/elastic-recheck/tree/queries