zuul/releasenotes/notes/change-cache-prune-fix-eac72e164927c028.yaml
James E. Blair bfe5a4a935 Handle concurrent modification during change cache delete
The following race was observed:

1) Several hours before the error, an event caused a change
   to be queried and added to the cache.
2) The change was enqueued in a pipeline for a while and
   therefore stayed in the relevant set.
3) The change was removed from the pipelines.
4) A cache prune process started shortly before the error and
   calculated the relevant set (the change was not in this set)
   and also the changes that were last modified > 1 hour ago
   (the change was in this set).  This combination means the
   entry is subject to pruning.
5) The cache cleanup starts slowly deleting changes (this
   takes about 3 minutes).
6) An event arrives for the change.  Gerrit is queried and the
   updated change is inserted into the cache.
7) The cache cleanup method gets around to deleting the change
   from the cache.
8) Subsequent queue processes can't find the change in the cache
   and raise an exception.

Or, in fewer words, the change was updated between the decision
time for the deletion and the deletion itself.

The kazoo delete method takes a version argument which will alert
us if the znode it would delete is of a different version than
specified.  If we remember the version of the cache entry from
when we decide to delete it, we can avoid the race by ensuring that
the deleted znode hasn't been updated since our decision.  This
change implements that.

The 'recursive' parameter is removed since it causes the version
check to always pass.  There are no children under the cache entry,
so it's not necessary.  It was likely only added to simplify the
case where we delete a node which is already deleted (NoNodeError).
To account for that, we handle that exception explicitly.

Change-Id: Ica840225fd52585a29452c80d90a4aa5e7763c8a
2021-10-18 15:01:35 -07:00

7 lines
169 B
YAML

---
fixes:
- |
A bug with the change cache cleanup routine which could have
caused items to be stuck in pipelines without running jobs has
been corrected.