We should be storing the capabilities.auth object as auth.info
in redux rather than a second copy of the whole info object. This
is used in only one place, to check whether the login button should
be displayed. The error was causing it to never be displayed.
This patch corrects that (and has been tested with multi-tenant,
whitelabel, and sub-path configurations).
Change-Id: I558ecf84f101150465eb5b62bc5787bf9a353793
This is an attempt to reorganize docs based on what we've learned so far:
* Audience is important -- help users find the job syntax reference without
getting bogged down in how to run zookeeper.
* Having distinct tutorials, howtos, and reference documentation is helpful.
* Grouping by subject matter is important; users shouldn't have to open tabs
with howto, reference, and tutorial to synthesize all the information on
a subject.
This reorg reduces the use of explicit reference/howto/tutorial/discussion
divisions since in some cases they never got sufficiently fleshed out (eg,
user tutorials), and in others the information was spread too thinly across
them all (eg authentication). However, those distinctions are still useful,
and the new organization reflects that somewhat.
I have made only some changes to content (generally in introductory sections
in order to make things make sense) and added a new "about Zuul" page. We
should still go through the documentation and update it and tweak the organization
further. This is mostly an attempt to get a new framework in place.
The theme is switched from alabaster to RTD. That's because RTD has really
good support for a TOC tree in the side bar with expansion. That makes a big
difference when trying to navigate large documentation like this. The new
framework is intended to have very good left-hand navigation for users.
Change-Id: I5ef88536acf1a1e58a07827e06b07d06588ecaf1
When a scheduler starts, we run some cleanup routines in case
this is the first scheduler start after an unclean shutdown. One
of the routines detectes and releases leaked semaphores. There
were two issues with this which caused us to erroneously detect
a semaphore as leaked and delete it. Fixing either alone would
have been sufficient to avoid the issue; fixing both is correct.
1) The pipeline state object was not serializing its layout uuid.
This means that every time a new pipeline state object is created,
it is assumed that its layout is newer than what is in zookeeper,
and a re-enqueue must be performed. All existing queue items are
moved to the old_queues attribute, and a scheduler (possibly a
different one) will re-enqueue them.
2) The semaphore leak detection method did not ask for "old" queue
items. This means that if a re-enqueue is in progress, it
will only see the queue items which have been re-enqueued.
These two issues combined cause us to fairly reliably move all of
the queue items out of the active queue and then look at only the
active queue to determine which semaphores are in use (which is to
say, none) and then release those semaphores.
This patch corrects both issues.
Change-Id: Ibf5fca03bb3bd33fefdf1982b8245be0e09df567
Reconfiguration events for the same tenant are supposed to be merged
so that if many branches are created, only one reconfiguration happens.
However, an error existed in the merge method which raised an exception
preventing the merge, and the exception handler around the merge method
was too broad and therefore suppressed the error.
This is well tested, but with tests that look more like "unit" tests
rather than functional tests. They did not contain the branch_cache_ltime
dict which triggered the error.
This patch corrects the error and updates the test to be more realistic.
Change-Id: I0b5c7628a0580db55b56eec8c081ebee4d31d989
The gearman RPC methods are being removed, so this needs a change
to work with one of the web-based methods. The most simpla end
forward-looking method is zuul-client, so update it to use that.
Change-Id: If3f6ca4bae2b2beddb3bb71b36fdcba112722186
Older epehemeral zk state is not compatible with current zuul.
Add a release note indicating that users must run delete-state
when upgrading.
Change-Id: Ic62243279b070236ceb70fc779b6314ac7b39aa7
To facilitate automation of rolling restarts, configure the prometheus
server to answer readiness and liveness probes. We are 'live' if the
process is running, and we are 'ready' if our component state is
either running or paused (not initializing or stopped).
The prometheus_client library doesn't support this directly, so we need
to handle this ourselves. We could create yet another HTTP server that
each component would need to start, or we could take advantage of the
fact that the prometheus_client is a standard WSGI service and just
wrap it in our own WSGI service that adds the extra endpoints needed.
Since that is far simpler and less resounce intensive, that is what
this change does.
The prometheus_client will actually return the metrics on any path
given to it. In order to reduce the chances of an operator configuring
a liveness probe with a typo (eg '/healthy/ready') and getting the
metrics page served with a 200 response, we restrict the metrics to
only the '/metrics' URI which is what we specified in our documentation,
and also '/' which is very likely accidentally used by users.
Change-Id: I154ca4896b69fd52eda655209480a75c8d7dbac3
This adjusts the node request relative priority system to become
more coarse as queues become longer.
Currently in the case of a very long queue we might revise the
pending node requests of every item in the queue each time a change
is dequeued. This might be useful for a change near the head of
the queue since it could cause it to receive nodes faster, but
we don't need to update the node request for jobs with the 100th
change in the queue on every update.
To avoid all of the ZK writes generated by these updates, we will
quantize the relative priority logarithmically. So changes near
the head behave as before, but after 10 changes back we decrease
the resolution to buckets of 10, and after 100 changes, buckets of
100.
Change-Id: I63c4b5d08f0d2608bc54094ade7d46d0720ab25e
This is a transitive dep of the openapi sphinx plugin, and it just
released a 2.0 which is incompatible with it; this causes docs builds
to fail. Pin it until that is fixed.
Change-Id: Ic6a2a6520a6d4079f7d81e318316b46b68982cf3
The following is possible:
* Change 1 is updated in gerrit
* Change 2 which Depends-On change 1 is updated
* Change 3 which Depends-On change 2 is updated
* A long time passes
* Change 2 and 3 are updated again
* A short time passes
* Change 1 is pruned from the cache because it hasn't been updated
in 2 hours. Change 2 and 3 remain since they were recently updated.
* Change 3 is updated
* The driver sees that 3 depends on 2 and looks up 2
* The driver finds 2 in the cache and stops (it does not update 2
and therefore will not re-add 1 to the cache)
* Change 3 is added to a pipeline
* Pipeline processing fails because it can not resolve change 1
To correct this, once we have decided what changes are too old
and should be removed, and then reduced that set by the set of
changes in the pipeline, find the changes related to those changes
and further reduce the set to prune.
In other words, move the related change expansion from outside
the cache prune method to inside, so we expand the network of
changes inside the cache, not just the network of changes in the
pipeline.
Change-Id: I9a029bc92cf2eecaff7df3598a6c6993d85978a8
This adds a Zuul quick-start tutorial add-on that sets up a keycloak
server. This can be used by new users to demonstrate the admin api
capability, or developers for testing.
Change-Id: I7ce73ce499dd840ad43fd8d0c6544177d02a7187
Co-Authored-By: Matthieu Huin <mhuin@redhat.com>
Allow filtering searches per primary index; ie return only
builds or buildsets whose primary index key is greater than idx_min
or lower than idx_max. This is expected to increase queries speed
compared to using the offset argument when it is possible to do
so, since "offset" requires the database to sift through all results until
the offset is reached.
Change-Id: I420d71d7c62dad6d118310525e97b4a546f05f99
This better matches our use of icons elsewhere (example: build list
table column header). It also separates the icon from the queue
length indicator which reads sort of like an icon as well.
Change-Id: Ib460826e657ab3c9d36e7f2734c620f5afe43486
We encountered an issue in OpenDev where Zuul was unable to load
a queue item from zk because the object was missing (NoNodeError)
but we did not log which object it was, so further debugging was
impractical. To address this, log the znode path.
Change-Id: I25a4702adb65892c1d70db6bc560eb98a5cc205a
In the case that no auth info is provided, we were relying on an
exception handler to catch errors and leave the 'info' field in
redux set to null. We have such an exception handler for the
non-whitelable case, but we did not have one for whitelabeled
tenants. Therefore, a whitelabel tenant with no auth info would
raise an exception which broke the app.
To correct this, add an exception handler to the whitelabel case
(configureAuthFromInfo). Additionally, avoid raising that exception
in the first place (so that we always set auth_params), and then
update the login button accordingly.
Change-Id: I8023bdb0db085de7e5abc664ab1b2f67c0744be4
We originally wrote the change list to be a best-effort service for
the scheduler check for whether a change is in a pipeline (which
must be fast and can't lock each of the pipelines to read in the full
state). To make it even simpler, we avoided sharding and instead
limited it to only the first 1024 changes. But scope creep happened,
and it now also serves to provide the list of relevant changes to the
change cache. If we have a pipeline with 1025 changes and delete
one of them from the cache, that tenant will break, so this needs to
be corrected.
This change uses sharding to correct it. Since it's possible to
attempt to read a sharded object mid-write, we retry reads in the case
of exceptions until they succeed.
In most cases this should still only be a single znode, but we do
truncate sharded znodes, so there is a chance even in the case of
a small number of changes of reading incorrect data.
To resolve this for all cases, we retry reading until it succeeds.
The scheduler no longer reads the state at the start of pipeline
processing (it never needed to anyway), so if the data become corrupt,
a scheduler will eventually be able to correct it. In other words,
the main pipeline processing path only writes this, and the other
paths only read it.
(An alternative solution would be to leave this as it was and instead
load the full pipeline state for maintaining the change cache; that
runs infrequently enough that we can accept the cost. This method is
chosen since it also makes other uses of this object more correct.)
Change-Id: I132d67149c065df7343cbd3aea69988f547498f4
When we process a pipeline, we iterate over every shared change
queue, then iterate over every item in each queue. When the last
item in a queue is complete and removed from an independent
pipeline, the queue itself is also removed. Since we iterate over
the list without making a copy first, we might then skip over the
next queue. We didn't notice this because any time we change a
pipeline, we process it again immediately afterwords. Nonetheless,
this could lead to starting jobs which we immediately cancel and
other inefficiencies.
What did make us take notice is that we accumulate the list of
changes in a pipeline and then use that to prune the change cache.
If we skip an item we would omit that change from the list of
relevant changes and delete it from the cache. The timing for this
has to be just right on two schedulers, but it happened in OpenDev.
To correct this, make a copy of the queue before iterating.
It is not feasible to reasonably test this because of the fact that
we immediately run the pipeline again after changing it -- there is
no opportunity for the test framework to interrupt processing between
the two runs.
Change-Id: Iaf0bb998c069ad6e4f7c53ca5fb05bd9675410a0
It is possible for dependency updates requests to fail due to errors
with the source. Previously when this happened the change was ignored by
the pipeline and the user gets no feedback. Chances are high that
reporting back to the source will fail so we can't really notify of this
error.
Instead we retry the requests in the hope that the error is a one off
and we can continue to proceed with the originally requested job work.
Story: 2009687
Change-Id: Id010d8c6809b9f9c012b81992590e54bf5e7e1d8
Paginate results when searching for builds or buildsets. Since we do not know how
many results a query may return, display "results X-Y of many" by
default. If the total amount of results can be computed, display it.
Change-Id: If2985d6bcb194e3b074a860dab8cd5958b2577ec
This adds a variable which may be useful for debugging or auditing
the repo state of playbooks or roles for a job.
Change-Id: I86429a06ed8625faa72db6a19630de633f1694b6
This adds a nonvoting python38 tox unittest job that runs with
ZUUL_SCHEDULER_COUNT set to 2. We do this despite knowing the job will
probably fail so that we can start getting this consistent feedback on
changes in pre merge testing.
Change-Id: I08fac586a1a7140433f225988e490a1054cc69dd
Those tests are failing with multiple schedulers depending on which
scheduler completed the tenant reconfiguration first. As all assertions
are done on the objects from scheduler-0, they will fail if scheduler-1
completed the tenant reconfiguration first.
To make those tests work with multiple schedulers, we might want to wait
until all schedulers completed their reconfigs before doing the
assertions.
There are also a few test cases that don't rely directly on the results
of a tenant reconfiguration, but still use local tenant objects for
assertions. I assume that those tests are failing for the same reason -
because the tenant object used for the assertion is not up-to-date.
Change-Id: I4df816ec98f5fbab25cd412a5146f0f85d6d0138
In a multi scheduler setup, the gitwatcher sometimes fails when
iterating over the list of projects:
Traceback (most recent call last):
File "/tmp/zuul/zuul/driver/git/gitwatcher.py", line 155, in run
self.watcher_election.run(self._run)
File "/tmp/zuul/zuul/zk/election.py", line 28, in run
return super().run(func, *args, **kwargs)
File "/tmp/zuul/.tox/py37/lib/python3.7/site-packages/kazoo/recipe/election.py", line 54, in run
func(*args, **kwargs)
File "/tmp/zuul/zuul/driver/git/gitwatcher.py", line 146, in _run
self._poll()
File "/tmp/zuul/zuul/driver/git/gitwatcher.py", line 124, in _poll
for project in self.connection.projects:
RuntimeError: dictionary changed size during iteration
To fix this, we could simply iterate over a copy of the original list.
Change-Id: Ie23db5715f48d69293f7b2656ff1c377ec54e878
These tests are storing the reported index on the fake elasticsearch
backend which is a different instance for each scheduler. Thus,
depending on which scheduler reports the item, the assertions in
these tests might pass or fail.
Change-Id: Ia5206d9c6097392ceff1abe6a911aedab22f2709
Those tests are also using the fake github implementation which means
that every scheduler gets a different fake github instance. Thus,
assertions might fail depending on which scheduler did the interaction
with Github.
Change-Id: I4234a58fe15ce4e1cfbeb935fd18c635db0d8a9b
Currently, the GitConnection doesn't provide a 'source' attribute. So
far this wasn't a problem as the source attribute was only accessed
within the same connection/driver.
With the ChangeCache we are now accessing the source attribute outside
of the specific connection/driver combination which fails in case of a
GitConnection with the following exception:
Traceback (most recent call last):
File "/workspace/zuul/zk/vendor/watchers.py", line 177, in _log_func_exception
result = self._func(data, stat, event)
File "/workspace/zuul/zk/change_cache.py", line 201, in _cacheItemWatcher
self._get(key, data_uuid, zstat)
File "/workspace/zuul/zk/change_cache.py", line 306, in _get
change = self._changeFromData(data)
File "/workspace/zuul/zk/change_cache.py", line 410, in _changeFromData
project = self.connection.source.getProject(change_data["project"])
AttributeError: 'GitConnection' object has no attribute 'source'
This seems to be only a problem in a multi scheduler setup when the
ChangeCache tries to load a change from ZooKeeper.
To fix this we simply add the source attribute to the GitConnection.
This uses the same approach like other connections by calling
driver.getSource().
Change-Id: I9503bf117f7fab31038e11eef9db65205d55e2e5