zuul/releasenotes/notes/config-cache-regression-471cd339965b247e.yaml
James E. Blair 598db8a78b Fix error with config cache
The following sequence would trigger this error:

* A project merges a change which renames a zuul.d/* config file
* The scheduler is restarted

When the scheduler starts, it needs to read in every in-repo config
file.  Previously it would issue cat jobs for every project-branch in
the system which would then get the list of files and their contents
from the git repos.  Now it relies on the cache in ZooKeeper to
provide that list and their contents.

That cache is updated whenever a config change merges.  When that
happens, Zuul knows that the cache is invalid for that project-branch,
so it issues a cat job and stores the results in the cache.  This is
how the cache is populated (generally speaking; it is also populated
on startup if any new projects or branches have been added and are not
present in the cache).

Because we use the results of the cat job after the change merges, the
error is not observed immediately.  Only later when we rely on the
values in ZK does the error manifest, and that is because the contents
in ZK are a superset of all the files Zuul has seen.

The reason that we did not simply delete the entire contents of the
project-branch cache when we invalidate it is because a cat job is run
for a specific tenant, with a specific tenant-project-config (TPC).
This TPC may list extra files to include for only this project in this
tenant.  Therefore, two cat jobs run on the same project-branch but
for different tenants may return a different set of files.  If we
naively removed all the files, we would end up with the smallest
subset in the cache, which would be incorrect.

Obviously we do need to delete files if they really don't exist in the
repo.  We can do this safely if we delete files from the cache iff
they do not appear in the set returned by the cat job, but do match
the set of files we expect for this particular TPC.  In that case we
know that if the file really existed, the cat job would have returned
it.  That is what this change implements.

A test is added which shuts down and restarts the scheduler in the
middle of the test.  This is the first such test, so a little
adjustment in the test framework is needed to accommodate this.

Finally, a release note is included since operators may need to
perform a manual step after upgrading in order to reconcile the cache
with reality.

A small change is made to the file filter used when loading dynamic
configs in order to make the directory matching more correct and
consistent between the two cases.

Change-Id: I9a1ee94cf0b55ac04a8f0cc12ac7507cab18d44b
2021-08-18 13:26:21 -07:00

16 lines
602 B
YAML

---
upgrade:
- |
An error was found in a change related to Zuul's internal
configuration cache which could cause Zuul to use cached in-repo
configuration files which no longer exist. If a ``zuul.yaml`` (or
``zuul.d/*`` or any related variant) file was deleted or renamed,
Zuul would honor that change immediately, but would attempt to
load both the old and new contents from its cache upon the next
restart.
This error was introduced in version 4.8.0.
If upgrading from 4.8.0, run ``zuul-scheduler full-reconfigure``
in order to correctly update the cache.