Two issues were observed:
* Zuul-web required MQTT connection secrets
* Zuul-web required the keystore password
The first is now required because zuul-web must instantiate a
connection object for each defined connection in order to parse
pipeline definitions.
The second is an oversight in documentation. Zuul-web does use
the keystore to answer requests for public keys now (and we generate
public keys from private keys), so it does legitimately need access
to the keystore. This change adds a release note to indicate that
(our original release note for the keystore indicated that only the
scheduler and executor require it), and updates the documentation
for zuul-web to indicate it is required.
Change-Id: I4673c28272576e1e5d6d8123a93fb46abfc85348
The following sequence is possible:
1) New project added to tenant config file
2) Scheduler begins smart reconfiguration
3) Scheduler encounters problem accessing the branch list for the
new repo. This is treated as a ConfigurationError and added
to the layout error list but proceeds.
?) If at some point a branch listing for the new project has succeeded,
there will be an entry for the branch in the branch cache.
4) A zuul-web starts (or attempts a reconfiguration), sees the branch
in the branch cache, attempts to load the files from the config
cache, and fails to load the layout.
The scheduler and web show different behaviors because web is unable
to fetch files via mergers. To bring them closer to the same behavior,
treat missing files from the config cache as layout errors.
Change-Id: I76d659f558cc3ed95a6ba7259d11b457ca57976c
Zuul requires ZK connections are encrypted and ZK 3.5.1 or newer is
required to make that happen. Make this a bit more clear in the
documentation.
Story: 2009317
Change-Id: Icb335f69446f7db3d3e1e018d031c31c9a2be98b
Apscheduler requires tzlocal/pytz, and they have introduced a warning
which we can avoid by pinning to an earlier version. This is not likely
to be fixed in pytz 3.x, but will be in 4.x.
See https://github.com/agronholm/apscheduler/discussions/570
Change-Id: I9c0555ef107d411b8e2fac9dabc7547459e5ffa7
If we stop the scheduler via the command socket the stop method fails
with [1] because it tries to join the command thread which is the
thread that runs the stop function. Check for that to make stopping
via 'zuul-scheduler stop' work.
[1] Traceback
ERROR zuul.Scheduler: Exception while processing command
Traceback (most recent call last):
File "/opt/zuul/lib/python3.8/site-packages/zuul/scheduler.py", line 339, in runCommand
self.command_map[command]()
File "/opt/zuul/lib/python3.8/site-packages/zuul/scheduler.py", line 326, in stop
self.command_thread.join()
File "/usr/local/lib/python3.8/threading.py", line 1008, in join
raise RuntimeError("cannot join current thread")
RuntimeError: cannot join current thread
Change-Id: Icf711d5420eebcd24e0ca2c03ad95b416dad1274
To save space in ZooKeeper compress the data of ZKObjects. This way we
can reduce the amount of data stored in some cases by a factor of 15x
(e.g. for some job znodes).
In case for data that is not yet compressed the ZKObject will fall back
to loading the stored JSON data directly.
Change-Id: Ibb59d3dfc1db0537ff6d28705832f0717d45b632
The following potential problems were observed with FrozenJob secrets:
1) They may be repetitive: since the FrozenJob contains
lists of playbooks and each playbook record has a copy of all the
secrets which should be used for that playbook, if a job has multiple
playbooks the secrets will be repeated for each job. Consider a base
job with three playbooks: the base job's secrets will be included
three times.
2) They may be large: secrets in ZK are stored encrypted and suffer the
same size explosion that they do when encrypted into zuul.yaml files.
3) Take #1 and #2 together and we have the possibility of having FrozenJob
objects that are larger than 1MB which is a problem for ZK.
Address all three issues by offloading the secrets to a new ZK node if
they are large (using the existing JobData framework) and de-duplicate
them and refer to them by index.
There is no backwards compatability handling here, so the ZK state needs
to be deleted.
Change-Id: I32133e8dd0e933528381f1187d270142046ff08f
This option allows Zuul to continue to use ssh for Git operations even
when HTTP Password is set for the Gerrit connection. This enable REST
API usage by Zuul even when the Gerrit server requires SSH for Git
operations.
Change-Id: Ie16eac048a54b2a698397f47b232d31177c54e07
We should be storing the capabilities.auth object as auth.info
in redux rather than a second copy of the whole info object. This
is used in only one place, to check whether the login button should
be displayed. The error was causing it to never be displayed.
This patch corrects that (and has been tested with multi-tenant,
whitelabel, and sub-path configurations).
Change-Id: I558ecf84f101150465eb5b62bc5787bf9a353793
This adds a simplified version of the RTD version selector to doc
builds. It will list all tags at the time of the build, as well
as 'latest'.
Change-Id: I2870450ffd02f55509fcc1297d050b09deafbfb9
This is an attempt to reorganize docs based on what we've learned so far:
* Audience is important -- help users find the job syntax reference without
getting bogged down in how to run zookeeper.
* Having distinct tutorials, howtos, and reference documentation is helpful.
* Grouping by subject matter is important; users shouldn't have to open tabs
with howto, reference, and tutorial to synthesize all the information on
a subject.
This reorg reduces the use of explicit reference/howto/tutorial/discussion
divisions since in some cases they never got sufficiently fleshed out (eg,
user tutorials), and in others the information was spread too thinly across
them all (eg authentication). However, those distinctions are still useful,
and the new organization reflects that somewhat.
I have made only some changes to content (generally in introductory sections
in order to make things make sense) and added a new "about Zuul" page. We
should still go through the documentation and update it and tweak the organization
further. This is mostly an attempt to get a new framework in place.
The theme is switched from alabaster to RTD. That's because RTD has really
good support for a TOC tree in the side bar with expansion. That makes a big
difference when trying to navigate large documentation like this. The new
framework is intended to have very good left-hand navigation for users.
Change-Id: I5ef88536acf1a1e58a07827e06b07d06588ecaf1
When a scheduler starts, we run some cleanup routines in case
this is the first scheduler start after an unclean shutdown. One
of the routines detectes and releases leaked semaphores. There
were two issues with this which caused us to erroneously detect
a semaphore as leaked and delete it. Fixing either alone would
have been sufficient to avoid the issue; fixing both is correct.
1) The pipeline state object was not serializing its layout uuid.
This means that every time a new pipeline state object is created,
it is assumed that its layout is newer than what is in zookeeper,
and a re-enqueue must be performed. All existing queue items are
moved to the old_queues attribute, and a scheduler (possibly a
different one) will re-enqueue them.
2) The semaphore leak detection method did not ask for "old" queue
items. This means that if a re-enqueue is in progress, it
will only see the queue items which have been re-enqueued.
These two issues combined cause us to fairly reliably move all of
the queue items out of the active queue and then look at only the
active queue to determine which semaphores are in use (which is to
say, none) and then release those semaphores.
This patch corrects both issues.
Change-Id: Ibf5fca03bb3bd33fefdf1982b8245be0e09df567
Reconfiguration events for the same tenant are supposed to be merged
so that if many branches are created, only one reconfiguration happens.
However, an error existed in the merge method which raised an exception
preventing the merge, and the exception handler around the merge method
was too broad and therefore suppressed the error.
This is well tested, but with tests that look more like "unit" tests
rather than functional tests. They did not contain the branch_cache_ltime
dict which triggered the error.
This patch corrects the error and updates the test to be more realistic.
Change-Id: I0b5c7628a0580db55b56eec8c081ebee4d31d989
The gearman RPC methods are being removed, so this needs a change
to work with one of the web-based methods. The most simpla end
forward-looking method is zuul-client, so update it to use that.
Change-Id: If3f6ca4bae2b2beddb3bb71b36fdcba112722186
Older epehemeral zk state is not compatible with current zuul.
Add a release note indicating that users must run delete-state
when upgrading.
Change-Id: Ic62243279b070236ceb70fc779b6314ac7b39aa7
To facilitate automation of rolling restarts, configure the prometheus
server to answer readiness and liveness probes. We are 'live' if the
process is running, and we are 'ready' if our component state is
either running or paused (not initializing or stopped).
The prometheus_client library doesn't support this directly, so we need
to handle this ourselves. We could create yet another HTTP server that
each component would need to start, or we could take advantage of the
fact that the prometheus_client is a standard WSGI service and just
wrap it in our own WSGI service that adds the extra endpoints needed.
Since that is far simpler and less resounce intensive, that is what
this change does.
The prometheus_client will actually return the metrics on any path
given to it. In order to reduce the chances of an operator configuring
a liveness probe with a typo (eg '/healthy/ready') and getting the
metrics page served with a 200 response, we restrict the metrics to
only the '/metrics' URI which is what we specified in our documentation,
and also '/' which is very likely accidentally used by users.
Change-Id: I154ca4896b69fd52eda655209480a75c8d7dbac3
This adjusts the node request relative priority system to become
more coarse as queues become longer.
Currently in the case of a very long queue we might revise the
pending node requests of every item in the queue each time a change
is dequeued. This might be useful for a change near the head of
the queue since it could cause it to receive nodes faster, but
we don't need to update the node request for jobs with the 100th
change in the queue on every update.
To avoid all of the ZK writes generated by these updates, we will
quantize the relative priority logarithmically. So changes near
the head behave as before, but after 10 changes back we decrease
the resolution to buckets of 10, and after 100 changes, buckets of
100.
Change-Id: I63c4b5d08f0d2608bc54094ade7d46d0720ab25e
This is a transitive dep of the openapi sphinx plugin, and it just
released a 2.0 which is incompatible with it; this causes docs builds
to fail. Pin it until that is fixed.
Change-Id: Ic6a2a6520a6d4079f7d81e318316b46b68982cf3
The following is possible:
* Change 1 is updated in gerrit
* Change 2 which Depends-On change 1 is updated
* Change 3 which Depends-On change 2 is updated
* A long time passes
* Change 2 and 3 are updated again
* A short time passes
* Change 1 is pruned from the cache because it hasn't been updated
in 2 hours. Change 2 and 3 remain since they were recently updated.
* Change 3 is updated
* The driver sees that 3 depends on 2 and looks up 2
* The driver finds 2 in the cache and stops (it does not update 2
and therefore will not re-add 1 to the cache)
* Change 3 is added to a pipeline
* Pipeline processing fails because it can not resolve change 1
To correct this, once we have decided what changes are too old
and should be removed, and then reduced that set by the set of
changes in the pipeline, find the changes related to those changes
and further reduce the set to prune.
In other words, move the related change expansion from outside
the cache prune method to inside, so we expand the network of
changes inside the cache, not just the network of changes in the
pipeline.
Change-Id: I9a029bc92cf2eecaff7df3598a6c6993d85978a8