Update gerrit changes more atomically

The following problem was observed:

Change A depends-on change B, which is in turn the tip of a
patch series of several changes.

Drivers warm the change cache on events by querying information
about changes related to those events.  But they don't process
depends-on headers, which means most drivers only warm one change,
and while the gerrit driver will follow other types of dependency
links which are unique to it, it stops at depends-on boundaries.

So in the example above, the only change in the cach which was warm
was change A.

The triggering event was enqueued, forwarded, and processed by
two responding pipelines simultaneously on two executors.

Each of them noticed the depends-on link and started querying gerrit
for change B and its dependencies.  One of the schedulers was about
1 second ahead of the other in this process.

In the gerrit driver, there is a two phase process for updating
changes.  First the change itself is updated in the normal way
common to all drivers, and then gerrit-specific dependency links
are updated.  That means the change is added to the change cache
with no dependencies, then mutated to add dependencies later.

The first scheduler added change B to the cache with no dependencies.
The second scheduler saw the update and refreshed its copy of B.
The second scheduler begin updating B, saw that the ltime of its
copy of B was sufficiently new it didn't need to update the cache
and stopped updating.
The second scheduler enqueued changes A and B, but no others in its
pipeline.
The first scheduler finished querying the stack of changes ending at
B, added them to the change cache, and mutated the entry for B in the
cache.
The first scheduler enqueued A, B, and the rest of the stack in its
pipeline.
The second scheduler updated its copy of B to include the new
dependencies.
The second scheduler ran a pipeline processor, noticed that B lacked
dependencies, and dequeued A and B, and reported an error to the user.

The good news is that Zuul noticed the mistake and dequeued the
changes.

To correct this, we will now collect all of the information about a
change and its gerrit-specific dependencies before writing any of
that information to the change cache.  This means that in our example
above, the second scheduler would not have aborted its queries.
Eventually, both schedulers would end up with the same information
before enqueing anything.

This process is still not quite optimal, in that we will have multiple
schedulers racing to update the same changes at the same time, but
they are designed to deal with collisions like that, so it should
at least be correct.

A future area of work might be to investigate whether we can optimize
this case further.

Change-Id: I647c2b54a55789e521fca71c8c3814907df65da6
This commit is contained in:
James E. Blair
2024-02-21 14:18:38 -08:00
parent 8dd4011aa0
commit ced306d5b1
3 changed files with 46 additions and 29 deletions

View File

@@ -115,7 +115,8 @@ class DependentPipelineManager(SharedQueuePipelineManager):
# cycle changes will otherwise be partially enqueued without
# any error handling
self.log.debug(
" Skipping change %s due to dependency cycle"
" Skipping change %s due to dependency cycle",
other_change
)
continue