We currently typically only emit build stats for success and failure,
but do not emit stats for NODE_FAILURE, CANCELED, SKIPPED, etc.
To round out the feature so that all build results are emitted, this
includes a small refactor to report build stats at the same time we
report build results to the database (it almost all cases). One
exception to that is when we receive a non-current build result --
that generally happens after a build is canceled, so we don't need
to emit the stats again.
Change-Id: I3bdf4e2643e151e2eae9f204f62cdc106df876b4
This change adds a warning (viewable in the web UI) when multiple
configurations are found for a project-branch combination.
Change-Id: I3defb2b1e6d7b59c450361c11d709802fe193373
When a build pauses, it can also return a list of child jobs to execute.
If the paused build was deduplicated we need to call `setResult()` on
all items that have that particular build.
Change-Id: Iead5c02032bccf46852ee6b2c8adf714689aa2f5
These tests would fail without the safety check introduced in the
parent change.
They were originally written under the assumption that we would
have the optimal behavior for the pipeline processor, where some
builds may be canceled but others may proceed after certain updates
to circular dependency bundles. However, fixing that is proving
impractical without a refactor, so these tests are added in updated
form which mostly asserts that a lot of jobs are aborted.
The original form of the tests are also added in this change, but
with skip decorators attached. It is hoped that after some
refactoring of circular dependency handling, we can use the original
form of these tests to validate the desired behavior.
Change-Id: Ie45da9fc1848a717bf5308e595edd27e598d6882
graphql queries (I77be4f16cf7eb5c8035ce0312f792f4e8d4c3e10) require
authentication. Enqueueing changes from GitHub (including Depends-On)
requires we run a graphql query. This means that Zuul must be able to
authenticate either via an application or api_token to support features
like Depends-On.
If the app is setup (app_id in config) but we aren't installed with
permissions on the project we're looking up, then fall back to using a
specified api_token. This will make Depends-On work.
Logging is updated to reflect whether or not we are able to fallback to
the api_token if the application is not installed. We log the lack of an
application installation at info level if we can fallback to the token,
and log at error level if we're falling back to anonymous access.
For backward compatibility we continue to fallback to anonymous access
if neither an application install or api_token are present. The reason
for this is features like Job required-projects: work fine anonymously,
and there may be Zuul installations that don't need additional
functionality.
Keep in mind that authenticated requests to GitHub get larger API rate
limits. Zuul installations should consider setting an API token even
when using an application for this reason. This gives Zuul the best
chance that fallback requests will not be rate limited.
Documentation is updated, a changelog added and several test
configuration files are padded with the required info.
Story: #2008940
Change-Id: I2107aeafc55591eea790244701567569fa6e80d4
We currently iterate over every job/host/etc variable in the freeze
playbook. The reason is because if any value in any variable is
Undefined according to jinja, the Ansible combine filter throws
an error. What we want to do in Zuul is merge any variable we can,
but if any is undefined, we skip it. Thus, the process of combining
the variables one at a time in a task and ignoring errors.
This process can be slow, especially if we have start with a large
amount of data in one of the early variables. The combine filter
needs to reprocess the large data repeatedly for each additional
variable.
To improve the process, we create a new action plugin, "zuul_freeze"
which takes a list of variables we want to freeze, then templates
them one at a time and stores the result in a cacheable fact. This
is the essence of what we were trying to accomplish with the combine
filter.
Change-Id: Ie41f404762daa1b1a5ae47f6ec1aa1954ad36a39
Several recent bugs and attempted fixes have shown that there may
be some edge cases in the handling of dependency cycles that have
the potential to cause jobs to run with the wrong changes in place.
While we work on longer-term fixes to those, add a safety check to
the pipeline processor so that if we detect a change to the bundle
contents of a queue item, we remove the item from the queue. We
may not necessarily perform the optimal behavior with this, but it
should keep us from running jobs with known incorrect changes.
This change requires some minor adjustment to some existing unit
tests (it doesn't significantly change the outcome, but it does
cause some jobs to be aborted sooner). A followup change will add
some more tests which would fail without this change but merit
separate review.
Change-Id: Ia7b1d5b7e3d6910a709478082929f96364ca996b
When updating topic dependencies we only consider the current patchset.
However, when there are changes that have a git-level dependency to an
outdated patchset and those dependencies are in a different topic we can
run into max. recursion depth issues.
To fix this problem, we will keep a history of topics that we've already
processed and not call `getChangesByTopic()` again when we've already
seen a topic.
Change-Id: I6c15f502cfd593a44d7adda930670692151b6713
If a second change is enqueued in check significantly after the
first, then node requests for child jobs may not be deduplicated
correctly.
Before deduplicating a build, Zuul applies parent data to child
jobs, then compares the child jobs to determine if they are
equivalent. If so, then they are deduplicated.
This only happens one level in the hierarchy at a time. Consider
the case where a change is enqueued and both the parent and child
jobs have completed (but the change is still in the queue waiting
on a third, unrelated, job).
If the second change in the bundle is enqueued, Zuul will:
1) Attempt to apply parent data to child jobs.
Since no jobs have completed yet for this item, no parent data
are applied.
2) Deduplicate jobs in the second change.
Zuul will deduplicate the parent job at this point.
3) Zuul will compare the child jobs in the two changes and determine
they are different because one has parent data and the other does
not.
4) Zuul submits a node request for the child job.
5) On the next pipeline process, Zuul applies the parent data from
the deduplicated parent job to the new child job.
6) Zuul deduplicates the child job, and the nodepool request is
orphaned.
To correct this, we will repeat the process of applying parent data
to child jobs each time we find at least one build to deduplicate.
That means that all existing parent data will be applied to all jobs
on each pass through the pipeline processor no matter how deep the
dependency hierarchy.
Change-Id: Ifff17df40f0d59447f74cdde619246171279b553
When a deduplicated job paused, it would not wait for all children
across all queue items to complete before resuming; instead it
would wait only for the children in its own queue item.
Check all queue items a build is in before resuming it.
Change-Id: Ic2dec3a6dc58230b0873d7e8ba474bc39ed28385
There is enough difference between how jobs are deduplicated in
check vs gate that we should test both code paths. This duplicates
the remaining tests.
Change-Id: I539a61a575b036021fd46b363a4f7f09262db3a7
When we deduplicate jobs, we intend to call setResult on all of
the queue items with the deduplicated build. This only worked
in dependent pipelines because we only looked for queue items in
the current bundle. In independent pipelines, the queue items
can be in different bundles.
To resolve this, search for items with deduplicated builds in
across the whole queue in independent pipelines (using the approach
we use when deduplicating them to begin with).
Change-Id: I16436710c47b4f22df39e0cd82d0e289b2293c32
To protect Zuul servers from accidental DoS attacks in case someone,
say, uploads a 1k change tree to gerrit, add an option to limit the
dependency processing in the Gerrit driver and in Zuul itself (since
those are the two places where we recursively process deps).
Change-Id: I568bd80bbc75284a8e63c2e414c5ac940fc1429a
The _reportNonEqueuedItem() method is used to temporarily enqueue a
change, report it and directly dequeue it. However, when the reporting
fails e.g. due to a config error, the item will never be dequeued.
This results in a leaked change that causes the queue processor to
loop over it indefinitely.
In our case the config error was caused by disabling the branch
protection in GitHub for a release branch in a certain repository. This
branch also defined a project-template which could not be found by Zuul
anymore after the branch protection was disabled [1].
This behaviour can be reproduced in a unit test by enforcing a broken
tenant configuration that references a non-existing project template
during a pipeline run with a circular dependency.
To fix this, ensure that the temporary enqueued item in
_reportNonEqueuedItem() will be dequeued in any case.
Although this fixes the endless loop in the queue processor, the same
exception will still be raised on pipeline level ("exception processing
pipeline...").
[1]:
2023-08-28 15:28:53,507 ERROR zuul.Pipeline.example-tenant.check: [e: 06d1ab80-45b7-11ee-8c99-721bf9f22e8c] Unable to re-enqueue change <Change 0x7f066ba36090 example-tenant/project 1234,80b4068eb1fe485df59185f0c93059fe7b15c23e> which is missing dependencies
Traceback (most recent call last):
File "/opt/zuul/lib/python3.11/site-packages/zuul/manager/__init__.py", line 1614, in _processOneItem
quiet_dequeue = self.addChange(
^^^^^^^^^^^^^^^
File "/opt/zuul/lib/python3.11/site-packages/zuul/manager/__init__.py", line 583, in addChange
if not self.enqueueChangesAhead(change, event, quiet,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/zuul/lib/python3.11/site-packages/zuul/manager/independent.py", line 73, in enqueueChangesAhead
r = self.addChange(needed_change, event, quiet=True,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/zuul/lib/python3.11/site-packages/zuul/manager/__init__.py", line 583, in addChange
if not self.enqueueChangesAhead(change, event, quiet,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/zuul/lib/python3.11/site-packages/zuul/manager/independent.py", line 73, in enqueueChangesAhead
r = self.addChange(needed_change, event, quiet=True,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/zuul/lib/python3.11/site-packages/zuul/manager/__init__.py", line 583, in addChange
if not self.enqueueChangesAhead(change, event, quiet,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/zuul/lib/python3.11/site-packages/zuul/manager/independent.py", line 73, in enqueueChangesAhead
r = self.addChange(needed_change, event, quiet=True,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/zuul/lib/python3.11/site-packages/zuul/manager/__init__.py", line 583, in addChange
if not self.enqueueChangesAhead(change, event, quiet,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/zuul/lib/python3.11/site-packages/zuul/manager/independent.py", line 73, in enqueueChangesAhead
r = self.addChange(needed_change, event, quiet=True,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/zuul/lib/python3.11/site-packages/zuul/manager/__init__.py", line 583, in addChange
if not self.enqueueChangesAhead(change, event, quiet,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/zuul/lib/python3.11/site-packages/zuul/manager/independent.py", line 73, in enqueueChangesAhead
r = self.addChange(needed_change, event, quiet=True,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/zuul/lib/python3.11/site-packages/zuul/manager/__init__.py", line 583, in addChange
if not self.enqueueChangesAhead(change, event, quiet,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/zuul/lib/python3.11/site-packages/zuul/manager/independent.py", line 73, in enqueueChangesAhead
r = self.addChange(needed_change, event, quiet=True,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/zuul/lib/python3.11/site-packages/zuul/manager/__init__.py", line 611, in addChange
self._reportNonEqueuedItem(change_queue,
File "/opt/zuul/lib/python3.11/site-packages/zuul/manager/__init__.py", line 676, in _reportNonEqueuedItem
if self.pipeline.tenant.layout.getProjectPipelineConfig(ci):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/zuul/lib/python3.11/site-packages/zuul/model.py", line 8194, in getProjectPipelineConfig
templates = self.getProjectTemplates(template_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/zuul/lib/python3.11/site-packages/zuul/model.py", line 8127, in getProjectTemplates
raise TemplateNotFoundError("Project template %s not found" % name)
zuul.model.TemplateNotFoundError: Project template template-foo not found
Change-Id: I2514b783b646caae2863ee1ccbac4600772fe4d6
The HEAD Gerrit API endpoint returns '/refs/heads/master', not
'master' as the test fixture was constructed with. Correct this.
Change-Id: I98b0759516bd50c0eddeb9245fc951c58e80ee45
If a configuration error existed for a project on one branch
and a change was proposed to update the config on another branch,
that would activate a code path in the manager which attempts to
determine whether errors are relevant. An error (or warning) is
relevant if it is not in a parent change, and is on the same
project+branch as the current patch. This is pretty generous.
This means that if a patch touches Zuul configuration with a
warning, all warnings on that branch must be updated. This was
not the intended behavior.
To correct that, we no longer consider warnings in any of the
places where we check that a queue item is failing due to
configuration errors.
An existing test is updated to include sufficient setup to trigger
the case where a new valid configuration is added to a project
with existing errors and warnings.
A new test case is added to show that we can add new deprecations
as well, and that they are reported to users as warnings.
Change-Id: Id901a540fce7be6fedae668390418aca06a950af
Prior to this change we checked if there are any errors in the config
(which includes warnings by default) and return a build error if there
are. Now we only check that proper errors are present when returning
errors.
This allows users to push config updates that don't fix all warnings
immediately. Without this any project with warnings present would need
to fix all warnings before newly proposed configs can take effect. This
is particularly problematic for speculative testing, but in general it
seems like warnings shouldn't be fatal.
Change-Id: I31b094fb366328696708b019354b843c4b94ffc0
Also add the configured window size to the pipeline stats.
Remove the ambiguous phrasing "since Zuul started" from some of
the counter documentation.
Change-Id: Icbb7bcfbf25a1e34d26dd865fa29f61faceb4683
This allows users to set a maximum value for the active window
in the event they have a project that has long stretches of
passing tests but they still don't want to commit too many resources
in case of a failure.
We should all be so lucky.
Change-Id: I52b5f3a9e7262b88fb16afc4520b35854e8df184
This method has gotten somewhat complicated, so add a unit test
that verifies that deduplication works and also verify that two
secrets with the same name but different contents are not
deduplicated.
Change-Id: I17ced6eea3c867a1119251631113550581c40bf4
This adds a configuration warning (viewable in the web UI) for any
regular expressions found in in-repo configuration that can not
be compiled and executed with RE2.
Change-Id: I092b47e9b43e9548cafdcb65d5d21712fc6cc3af
The re2 library does not support negative lookahead expressions.
Expressions such as "(?!stable/)", "(?!master)", and "(?!refs/)" are
very useful branch specifiers with likely many instances in the wild.
We need to provide a migration path for these.
This updates the configuration options which currently accepts Python
regular expressions to additionally accept a nested dictionary which
allows specifying that the regex should be negated. In the future,
other options (global, multiline, etc) could be added.
A very few options are currently already compiled with re2. These are
left alone for now, but once the transition to re2 is complete, they
can be upgraded to use this syntax as well.
Change-Id: I509c9821993e1886cef1708ddee6d62d1a160bb0
The current "layout_lock" in the scheduler is really an "abide" lock.
We lock it every time we change something in the abide (including
tenant layouts). The name is inherited from pre-multi-tenant Zuul.
This can cause some less-than-optimal behavior when we need to wait to
acquire the "layout_lock" for a tenant reconfiguration event in one
thread while another thread holds the same lock because it is
reloading the configuration for a different tenant. Ideally we should
be able to have finer-grained tenant-level locking instead, allowing
for less time waiting to reconfigure.
The following sections describe the layout lock use prior to this
commit and how this commit adjusts the code to make it safe for
finer-grained locking.
1) Tenant iteration
The layout lock is used in some places (notably some cleanup methods)
to avoid having the tenant list change during the method. However,
the configloader already performs an atomic replacement of the tenant
list making it safe for iteration. This change adds a lock around
updates to the tenant list to prevent corruption if two threads update
it at the same time.
The semaphore cleanup method indirectly references the abide and
layout for use in global and local semaphores. This is just for path
construction, and the semaphores exist apart from the abide and layout
configurations and so should not be affected by either changing while
the cleanup method is running.
The node request cleanup method could end up running with an outdated
layout objects, including pipelines, however it should not be a
problem if these orphaned objects end up refreshing data from ZK right
before they are removed.
In these cases, we can simply remove the layout lock.
2) Protecting the unparsed project branch cache
The config cache cleanup method uses the unparsed project branch cache
(that is, the in-memory cache of the contents of zuul config files) to
determine what the active projects are.
Within the configloader, the cache is updated and then also used while
loading tenant configuration. The layout lock would have made sure
all of these actions were mutually exclusive. In order to remove the
layout lock here, we need to make sure the Abide's
unparsed_project_branch_cache is safe for concurrent updates.
The unparsed_project_branch_cache attribute is a dictionary that
conains references to UnparsedBranchCache objects. Previously, the
configloader would delete each UnparsedBranchCache object from the
dictionary, reinitialize it, then incrementially add to it.
This current process has a benign flaw. The branch cache is cleared,
and then loaded with data based on the tenant project config (TPC)
currently being processed. Because the cache is loaded based on data
from the TPC, it is really only valid for one tenant at a time despite
our intention that it be valid for the entire abide. However, since
we do check whether it is valid for a given TPC, and then clear and
reload it if it is not, there is no error in data, merely an
incomplete utilization of the cache.
In order to make the cache safe for use by different tenants at the
same time, we address this problem (and effectively make it so that it
is also *effective* for different tenants, even at different times).
The cache is updated to store the ltime for each entry in the cache,
and also to store null entries (with ltimes) for files and paths that
have been checked but are not present in the project-cache. This
means that at any given time we can determine whether the cache is
valid for a given TPC, and support multiple TPCs (i.e., multiple
tenants).
It's okay for the cache to be updated simultaneously by two tenants
since we don't allow the cache contents to go backwards in ltime. The
cache will either have the data with at least the ltime required, or
if not, that particular tenant load will spawn cat jobs and update it.
3) Protecting Tenant Project Configs (TPCs)
The loadTPC method on the ConfigLoader would similarly clear the TPCs
for a tenant, then add them back. This could be problematic for any
other thread which might be referencing or iterating over TPCs. To
correct this, we take a similar approach of atomic replacement.
Because there are two flavors of TPCs (config and untrusted) and they
are stored in two separate dictionaries, in order to atomically update
a complete tenant at once, the storage hierarchy is restructured as
"tenant -> {config/untrusted} -> project" rather than
"{config/untrusted} -> tenant -> project". A new class named
TenantTPCRegistry holds both flavors of TPCs for a given tenant, and
it is this object that is atomically replaced.
Now that these issues are dealt with, we can implement a tenant-level
thread lock that is used simply to ensure that two threads don't
update the configuration for the same tenant at the same time.
The scheduler's unparsed abide is updated in two places: upon full
reconfiguration, or when another scheduler has performed a full
reconfiguration and updated the copy in ZK. To prevent these two
methods from performing the same update simultaneously, we add an
"unparsed_abide_lock" and mutually exclude them.
Change-Id: Ifba261b206db85611c16bab6157f8d1f4349535d
This extends the previous change to include project default branch
support for the Gerrit driver as well as GitHub.
Change-Id: I2b1f6feed72277f5e61a2789d8af5276ee4c7b05
This supplies a per-project default value for Zuul's default-branch
based on what the default branch is set to in GitHub. This means
that if users omit the default-branch setting on a Zuul project
stanza, Zuul will automatically use the correct value.
If the value in GitHub is changed, an event is emitted which allows
us to automatically reconfigure the tenant.
This could be expanded to other drivers that support an indication
of which branch is default.
Change-Id: I660376ecb3f382785d3bf96459384cfafef200c9
We subclass the yaml.SafeDumper class to adjust its behavior with an
override of the ignore_aliases method. It is possible to subclass
the cyaml.CSafeDumper class as well. The "C" part is actually the
Parser and Emitter, not the Dumper/Representer, so our override
is still effective whether we use the C or Python versions.
This can produce a significant performance increase when exchanging
large amounts of data with Ansible.
The C emitter is more aggressive about not using unecessary quotes,
so the ansible dumper test assertions need to change. To add some
extra assurance, that test is also updated to check that the round-trip
load is as expected as well.
Change-Id: I30fd82c0b9472120d010f3f4a65e17fb426b0f7e
This allows users to trigger the new early failure detection by
matching regexes in the streaming job output.
For example, if a unit test job outputs something sufficiently
unique on failure, one could write a regex that matches that and
triggers the early failure detection before the playbook completes.
For hour-long unit test jobs, this could save a considerable amount
of time.
Note that this adds the google-re2 library to the Ansible venvs. It
has manylinux wheels available, so is easy to install with
zuul-manage-ansible. In Zuul itself, we use the fb-re2 library which
requires compilation and is therefore more difficult to use with
zuul-manage-ansible. Presumably using fb-re2 to validate the syntax
and then later actually using google-re2 to run the regexes is
sufficient. We may want to switch Zuul to use google-re2 later for
consistency.
Change-Id: Ifc9454767385de4c96e6da6d6f41bcb936aa24cd
Combining stdout/stderr in the result can lead to problems when e.g.
the stdout of a task is used as an input for another task.
This is also different from the normal Ansible behavior and can be
surprising and hard to debug for users.
The new behavior is configurable and off by default to retain backward
compatibility.
Change-Id: Icaced970650913f9632a8db75a5970a38d3a6bc4
Co-Authored-By: James E. Blair <jim@acmegating.com>
In case of a retry there might be no logs available to help the user
understand the reason for a failure. To improve this we can the details
of the failure as part of the build result.
Change-Id: Ib9fdbdec5d783a347d1b6e5ce6510d50acfe1286