bfe5a4a935
The following race was observed: 1) Several hours before the error, an event caused a change to be queried and added to the cache. 2) The change was enqueued in a pipeline for a while and therefore stayed in the relevant set. 3) The change was removed from the pipelines. 4) A cache prune process started shortly before the error and calculated the relevant set (the change was not in this set) and also the changes that were last modified > 1 hour ago (the change was in this set). This combination means the entry is subject to pruning. 5) The cache cleanup starts slowly deleting changes (this takes about 3 minutes). 6) An event arrives for the change. Gerrit is queried and the updated change is inserted into the cache. 7) The cache cleanup method gets around to deleting the change from the cache. 8) Subsequent queue processes can't find the change in the cache and raise an exception. Or, in fewer words, the change was updated between the decision time for the deletion and the deletion itself. The kazoo delete method takes a version argument which will alert us if the znode it would delete is of a different version than specified. If we remember the version of the cache entry from when we decide to delete it, we can avoid the race by ensuring that the deleted znode hasn't been updated since our decision. This change implements that. The 'recursive' parameter is removed since it causes the version check to always pass. There are no children under the cache entry, so it's not necessary. It was likely only added to simplify the case where we delete a node which is already deleted (NoNodeError). To account for that, we handle that exception explicitly. Change-Id: Ica840225fd52585a29452c80d90a4aa5e7763c8a
7 lines
169 B
YAML
7 lines
169 B
YAML
---
|
|
fixes:
|
|
- |
|
|
A bug with the change cache cleanup routine which could have
|
|
caused items to be stuck in pipelines without running jobs has
|
|
been corrected.
|