Fix error with config cache

The following sequence would trigger this error:

* A project merges a change which renames a zuul.d/* config file
* The scheduler is restarted

When the scheduler starts, it needs to read in every in-repo config
file.  Previously it would issue cat jobs for every project-branch in
the system which would then get the list of files and their contents
from the git repos.  Now it relies on the cache in ZooKeeper to
provide that list and their contents.

That cache is updated whenever a config change merges.  When that
happens, Zuul knows that the cache is invalid for that project-branch,
so it issues a cat job and stores the results in the cache.  This is
how the cache is populated (generally speaking; it is also populated
on startup if any new projects or branches have been added and are not
present in the cache).

Because we use the results of the cat job after the change merges, the
error is not observed immediately.  Only later when we rely on the
values in ZK does the error manifest, and that is because the contents
in ZK are a superset of all the files Zuul has seen.

The reason that we did not simply delete the entire contents of the
project-branch cache when we invalidate it is because a cat job is run
for a specific tenant, with a specific tenant-project-config (TPC).
This TPC may list extra files to include for only this project in this
tenant.  Therefore, two cat jobs run on the same project-branch but
for different tenants may return a different set of files.  If we
naively removed all the files, we would end up with the smallest
subset in the cache, which would be incorrect.

Obviously we do need to delete files if they really don't exist in the
repo.  We can do this safely if we delete files from the cache iff
they do not appear in the set returned by the cat job, but do match
the set of files we expect for this particular TPC.  In that case we
know that if the file really existed, the cat job would have returned
it.  That is what this change implements.

A test is added which shuts down and restarts the scheduler in the
middle of the test.  This is the first such test, so a little
adjustment in the test framework is needed to accommodate this.

Finally, a release note is included since operators may need to
perform a manual step after upgrading in order to reconcile the cache
with reality.

A small change is made to the file filter used when loading dynamic
configs in order to make the directory matching more correct and
consistent between the two cases.

Change-Id: I9a1ee94cf0b55ac04a8f0cc12ac7507cab18d44b
This commit is contained in:
James E. Blair 2021-08-18 12:47:11 -07:00
parent 6eb84eb4bd
commit 598db8a78b
9 changed files with 142 additions and 11 deletions

View File

@ -0,0 +1,15 @@
---
upgrade:
- |
An error was found in a change related to Zuul's internal
configuration cache which could cause Zuul to use cached in-repo
configuration files which no longer exist. If a ``zuul.yaml`` (or
``zuul.d/*`` or any related variant) file was deleted or renamed,
Zuul would honor that change immediately, but would attempt to
load both the old and new contents from its cache upon the next
restart.
This error was introduced in version 4.8.0.
If upgrading from 4.8.0, run ``zuul-scheduler full-reconfigure``
in order to correctly update the cache.

View File

@ -4324,8 +4324,8 @@ class SchedulerTestManager:
def __setitem__(self, key: int, value: SchedulerTestApp):
raise Exception("Not implemented, use create method!")
def __delitem__(self, key, value):
raise Exception("Not implemented!")
def __delitem__(self, key):
del self.instances[key]
def __iter__(self):
return iter(self.instances)
@ -4595,11 +4595,7 @@ class ZuulTestCase(BaseTestCase):
self.builds = self.executor_server.running_builds
self.scheds = SchedulerTestManager(self.validate_tenants)
self.scheds.create(
self.log, self.config, self.changes,
self.additional_event_queues, self.upstream_root, self.rpcclient,
self.poller_events, self.git_url_with_auth,
self.fake_sql, self.addCleanup, self.validate_tenants)
self.createScheduler()
self.merge_server = None
@ -4611,6 +4607,13 @@ class ZuulTestCase(BaseTestCase):
self.scheds.execute(
lambda app: app.start(self.validate_tenants))
def createScheduler(self):
self.scheds.create(
self.log, self.config, self.changes,
self.additional_event_queues, self.upstream_root, self.rpcclient,
self.poller_events, self.git_url_with_auth,
self.fake_sql, self.addCleanup, self.validate_tenants)
def __event_queues(self, matcher) -> List[Queue]:
# TODO (felix): Can be removed when the nodes provisioned events are
# switched to ZooKeeper.

View File

@ -0,0 +1,2 @@
- hosts: all
tasks: []

View File

@ -0,0 +1,22 @@
- pipeline:
name: check
manager: independent
trigger:
gerrit:
- event: patchset-created
success:
gerrit:
Verified: 1
failure:
gerrit:
Verified: -1
- job:
name: base
parent: null
run: playbooks/job.yaml
- project:
name: common-config
check:
jobs: []

View File

@ -0,0 +1 @@
test

View File

@ -0,0 +1,8 @@
- job:
name: project-test1
- project:
name: org/project
check:
jobs:
- project-test1

View File

@ -0,0 +1,8 @@
- tenant:
name: tenant-one
source:
gerrit:
config-projects:
- common-config
untrusted-projects:
- org/project

View File

@ -2441,6 +2441,60 @@ class TestInRepoConfig(ZuulTestCase):
A.messages[0], "A should have debug info")
class TestInRepoConfigDir(ZuulTestCase):
# Like TestInRepoConfig, but the fixture test files are in zuul.d
tenant_config_file = 'config/in-repo-dir/main.yaml'
def test_file_move(self):
# Tests that a zuul config file can be renamed
in_repo_conf = textwrap.dedent(
"""
- job:
name: project-test2
- project:
name: org/project
check:
jobs:
- project-test2
""")
file_dict = {'zuul.d/project.yaml': None,
'zuul.d/newfile.yaml': in_repo_conf}
A = self.fake_gerrit.addFakeChange('org/project', 'master', 'A',
files=file_dict)
A.setMerged()
self.fake_gerrit.addEvent(A.getChangeMergedEvent())
self.waitUntilSettled()
B = self.fake_gerrit.addFakeChange('org/project', 'master', 'B')
self.fake_gerrit.addEvent(B.getPatchsetCreatedEvent(1))
self.waitUntilSettled()
self.assertHistory([
dict(name='project-test2', result='SUCCESS', changes='2,1'),
], ordered=True)
self.scheds[0].sched.stop()
self.scheds[0].sched.join()
del self.scheds[0]
self.log.debug("Restarting scheduler")
self.createScheduler()
self.scheds[0].start(self.validate_tenants)
self.waitUntilSettled()
# The fake gerrit was lost with the scheduler shutdown;
# restore the state we care about:
self.fake_gerrit.change_number = 2
C = self.fake_gerrit.addFakeChange('org/project', 'master', 'C')
self.fake_gerrit.addEvent(C.getPatchsetCreatedEvent(1))
self.waitUntilSettled()
self.assertHistory([
dict(name='project-test2', result='SUCCESS', changes='2,1'),
dict(name='project-test2', result='SUCCESS', changes='3,1'),
], ordered=True)
class TestGlobalRepoState(AnsibleZuulTestCase):
config_file = 'zuul-connections-gerrit-and-github.conf'
tenant_config_file = 'config/global-repo-state/main.yaml'

View File

@ -1787,8 +1787,10 @@ class TenantParser(object):
pb_ltime = min_ltimes[project.canonical_name][branch]
if files_cache.isValidFor(tpc, pb_ltime):
self.log.debug(
"Using files from cache for project %s @%s",
project.canonical_name, branch)
"Using files from cache for project "
"%s @%s: %s",
project.canonical_name, branch,
list(files_cache.keys()))
branch_cache = abide.getUnparsedBranchCache(
project.canonical_name, branch)
if branch_cache.isValidFor(tpc, files_cache.ltime):
@ -1842,6 +1844,22 @@ class TenantParser(object):
# Cache file in Zookeeper
if content is not None:
files_cache[fn] = content
# If there are any files in the cache which match our
# set of requested files but aren't in the set
# returned by the cat job, then they must have been
# deleted. Remove them from the cache so that the
# file list in the cache is correct.
ok_files = ('zuul.yaml', '.zuul.yaml') + tpc.extra_config_files
ok_dirs = ('zuul.d', '.zuul.d') + tpc.extra_config_dirs
for fn in files_cache.keys():
if ((fn in ok_files) or
any([fn.startswith(d + '/') for d in ok_dirs])):
if fn not in job.files:
self.log.debug(
"Removing file from cache for project "
"%s @%s: %s",
project.canonical_name, branch, fn)
del files_cache[fn]
files_cache.setValidFor(job.extra_config_files,
job.extra_config_dirs,
job.ltime)
@ -2405,10 +2423,10 @@ class ConfigLoader(object):
if fn.startswith(".zuul.d/"):
fns2.append(fn)
for ef in tpc.extra_config_files:
if fn.startswith(ef):
if fn == ef:
fns3.append(fn)
for ed in tpc.extra_config_dirs:
if fn.startswith(ed):
if fn.startswith(ed + '/'):
fns4.append(fn)
fns = (["zuul.yaml"] + sorted(fns1) + [".zuul.yaml"] +
sorted(fns2) + fns3 + sorted(fns4))