|
4 months ago | |
---|---|---|
data | 7 months ago | |
doc | 3 years ago | |
elastic_recheck | 5 months ago | |
queries | 4 months ago | |
tools | 11 months ago | |
web | 1 year ago | |
.coveragerc | 6 years ago | |
.dockerignore | 7 months ago | |
.gitignore | 5 months ago | |
.gitreview | 2 years ago | |
.pylintrc | 7 months ago | |
.testr.conf | 8 years ago | |
.zuul.yaml | 7 months ago | |
CONTRIBUTING.rst | 2 years ago | |
Dockerfile | 7 months ago | |
LICENSE | 8 years ago | |
MANIFEST.in | 8 years ago | |
Makefile | 7 months ago | |
README.rst | 7 months ago | |
babel.cfg | 8 years ago | |
bindep.txt | 7 months ago | |
elasticRecheck.conf.sample | 2 years ago | |
recheckwatchbot.yaml | 5 years ago | |
requirements.txt | 8 months ago | |
setup.cfg | 8 months ago | |
setup.py | 6 years ago | |
test-requirements.txt | 7 months ago | |
tox.ini | 7 months ago | |
web_server.py | 11 months ago |
"Use ElasticSearch to classify OpenStack gate failures"
Identifying the specific bug that is causing a transient error in the gate is difficult. Just identifying which tempest test failed is not enough because a single tempest test can fail due to any number of underlying bugs. If we can find a fingerprint for a specific bug using logs, then we can use ElasticSearch to automatically detect any occurrences of the bug.
Using these fingerprints elastic-recheck can:
All queries are stored in separate yaml files in a queries directory at the top of the elastic-recheck code base. The format of these files is ######.yaml (where ###### is the launchpad bug number), the yaml should have a query
keyword which is the query text for elastic search.
Guidelines for good queries:
tags:"screen-n-net.txt"
) is typically better than a console one (tags:"console"
), as that's matching a deep failure versus a surface symptom.tags:"screen-n-cpu.txt"
will query in logs/old/screen-n-cpu.txt
and logs/new/screen-n-cpu.txt
. The tags:"console"
filter is also used to query in console.html
as well as tempest and devstack logs.Avoid the use of wildcards in queries since they can put an undue burden on the query engine. A common case where wildcards are used and shouldn't be are in querying against a specific set of build_name
fields, e.g. gate-nova-python26
and gate-nova-python27
. Rather than use build_name:gate-nova-python*
, list the jobs with an OR
. For example:
(build_name:"gate-nova-python26" OR build_name:"gate-nova-python27")
When adding queries you can optionally suppress the creation of graphs and notifications by adding suppress-graph: true
or suppress-notification: true
to the yaml file. These can be used to make sure expected failures don't show up on the unclassified page.
If the only signature available is overly broad and adding additional logging can't reasonably make a good signature, you can also filter the results of a query based on the test_ids that failed for the run being checked. This can be done by adding a test_ids
keyword to the query file and then a list of the test_ids to verify failed. The test_id also should exclude any attrs, this is the list of attrs appended to the test_id between '[]'. For example, 'smoke', 'slow', any service tags, etc. This is how subunit-trace prints the test ids by default if you're using it. If any of the listed test_ids match as failing for the run being checked with the query it will return a match. Since filtering leverages subunit2sql which only receives tempest test results from the gate pipeline, this technique will only work on tempest or grenade jobs in the gate queue. For more information about this refer to the infra subunit2sql documentation For example, if your query yaml file looked like:
query: >-
message:"ExceptionA"
test_ids:
- tempest.api.compute.servers.test_servers.test_update_server_name
- tempest.api.compute.servers.test_servers_negative.test_server_set_empty_name
this will only match the bug if the logstash query had a hit for the run and either test_update_server_name or test_server_set_empty name failed during the run.
In order to support rapidly added queries, it's considered socially acceptable to approve changes that only add 1 new bug query, and to even self approve those changes by core reviewers.
Most transient bugs seen in gate are not bugs in tempest associated with a specific tempest test failure, but rather some sort of issue further down the stack that can cause many tempest tests to fail.
Tag your commit with a Related-Bug
tag in the footer, or add a comment to the bug with the query you identified and a link to the logstash URL for that query search.
Putting the logstash query link in the bug report is also valuable in the case of rare failures that fall outside the window of how far back log results are stored. In such cases the bug might be marked as Incomplete and the e-r query could be removed, only for the failure to re-surface later. If a link to the query is in the bug report someone can easily track when it started showing up again.
elastic-recheck/queries/BUGNUMBER.yaml
(All queries can be found on git.openstack.org) and push the patch up for review.You can also help classify Unclassified failed jobs, which is an aggregation of all failed voting gate jobs that don't currently have elastic-recheck fingerprints.
Old queries which are no longer hitting in logstash and are associated with fixed or incomplete bugs are routinely deleted. This is to keep the load on the elastic-search engine as low as possible when checking a job failure. If a bug marked as Incomplete does show up again, the bug should be re-opened with a link to the failure and the e-r query should be restored.
Queries that have "suppress-graph: true" in them generally should not be removed since we basically want to keep those around, they are persistent infra issues and are not going away.
Run the elastic-recheck-cleanup
command:
$ tox -e venv -- elastic-recheck-cleanup -h
...
usage: elastic-recheck-cleanup [-h] [--bug <bug>] [--dry-run] [-v]
Remove old queries where the affected projects list the bug status as one
of: Fix Committed, Fix Released
optional arguments:
-h, --help show this help message and exit
--bug <bug> Specific bug number/id to clean. Returns an exit code of
1 if no query is found for the bug.
--dry-run Print out old queries that would be removed but do not
actually remove them.
-v Print verbose information during execution.
Note
You may want to run with the --dry-run
option first and sanity check the removed queries before committing them.
Commit the changes and push them up for review:
$ git commit -a -m "Remove old queries: `date +%F`"
$ git review -t rm-old-queries
Note
Sometimes bugs are still New/Confirmed/Triaged/In Progress but have not had any hits in over 10 days. Those bugs should be re-assessed to see if they are now actually fixed or incomplete/invalid, marked as such and then remove the related query.
You can execute an individual query locally and analyze the search results:
$ elastic-recheck-query queries/1331274.yaml
total hits: 133
build_status
100% FAILURE
build_name
48% check-grenade-dsvm
15% check-grenade-dsvm-partial-ncpu
13% gate-grenade-dsvm
9% check-grenade-dsvm-icehouse
9% check-grenade-dsvm-partial-ncpu-icehouse
build_branch
95% master
4% stable/icehouse
In addition to using tox
you can also run make
in order to list current container build and testing commands.