Each scheduler in tests needs its own connection registries.
This change only affects tests.
Change-Id: I2ad188cf5a72c46f7486c23ab653ff574123308b
Story: 2007192
1.
gerritconnection: changes are using "data.change.branch" to fill
event.branch which doesn't have 'refs/heads/' prefix
branch creation and deletion were using "data.refUpdate.refName"
to fill event.branch which has the complete refname: it starts
with 'refs/heads/'
As a result, in the Scheduler event process queue, the cache was not
properly cleared when calling Abide's clearUnparsedBranchCache,
called using event.branch and not change.branch.
change.branch is already correctly set to the relative refname
When reconfiguring a tenant, it could hold data which were removed
and reconfigured tenant could still use data that had disappeared
This patch removes the prefix "refs/heads/" when setting event branch
name in case of deletion or creation
note: it's not possible to use event.branch to set change.branch in
gerritconnection.getChange as event objects doesn't have a constant base
type.
it can be either:
- GerritTriggerEvent -> TriggerEvent -> object
- DequeueEvent -> ManagementEvent -> object
the latter doesn't have a branch attribute
2.
Setting event.branch is moved to any 'ref-update' event, so it will now
be set when newrev and oldrev are both not null. This doesn't have and
should not have any impacts
Change-Id: Ie60b382b23074cc9feff0648e786ddaf0d3454aa
hasUnparsedBranchCache 1st argument is expecting the project's canonical name.
This could lead to cache not being properly cleaned
As a result 2 tests needed a fix
1. a github test was no longer working. In the current test class, the tenant
is not configured to globaly or per project exclude unprotected branches, so a
reconfigure is to expect.
A copy of this test is added in TestGithubUnprotectedBranches to use a project
which has exclude-unprotected-branches enabled.
Various fixes are also included:
- getPushEvent: no mutable default argument [1]
- branch_updated only when required
- change test structure of similar tests to send the proper sha1 when deleting
the branch
2. an exception appeared in test_pending_merge_in_reconfig:
Traceback (most recent call last):
File "/home/safranaerosystemsfcs/zuul/zuul/zuul/scheduler.py", line 1347, in process_management_queue
self._doTenantReconfigureEvent(event)
File "/home/safranaerosystemsfcs/zuul/zuul/zuul/scheduler.py", line 911, in _doTenantReconfigureEvent
project.canonical_name, branch)
AttributeError: 'NoneType' object has no attribute 'canonical_name'
after setting the project, the test was blocked during tenant reconfiguration.
It needs a merger to be completed.
This change reactivates the merger before starting a reconfigure.
Some additional checks are making sure the initial merge job did not complete
[1] https://docs.python-guide.org/writing/gotchas/#mutable-default-arguments
Change-Id: I1e0948944f71a74411e78e3b3e6b7eaf2ed36b63
After dequeue we currently immediately release all jobs. This can lead
in seldom cases that the job is successful before the job has been
canceled by the scheduler. To fix this wait until settled between
dequeue and build release.
Change-Id: Ic788daa209a1ef7efdd58185935a5ebff2e5fb39
If a build must be retried, the previous build information get lost.
To be informed that a build was retried, the retry builds are now part
of the mqtt message.
Change-Id: I8c93376f844c3d1c55c89a250384a7f835763677
Depends-On: https://review.opendev.org/704983
Replace `self.executor_client` with `scheduler.executor`.
This change only touches tests.
Change-Id: I22b01f2eff881e18633e5bab1ec390f3b5367a4d
Story: 2007192
To improve consistency and remove `self.sched` and
use `self.scheds.first.sched` in all tests.
This change only touches tests.
Change-Id: I0a8c3f8ad634d1d88f3c68a11e4d5587993f4a0d
Story: 2007192
As a preparation for scale-out-scheduler the scheduler in tests
were extracted in order to start multiple instances in a previous
change.
This change continues on by introducing a manager to create
additional scheduler instances and the ability to call certain
methods on some or all of those instances.
This change only touches tests.
Change-Id: Ia05a7221f19bad97de1176239c075b8fc9dab7e5
Story: 2007192
Setup config will not set the base test config object rather return a new one.
This allows to setup multiple config objects which is needed in order to
instantiate mutliple schedulers enabling them to have different configs, e.g.
command socket.
Change-Id: Icc7ccc82f7ca766b0b56c38e706e6e3215342efa
Story: 2007192
During reconfiguration, jobs may be re-frozen, in which case they
will have the "queued" flag reset to False. This is normally only
set to true at the time that a nodepool request is submitted for
the job. If a reconfiguration happens after a request is submitted,
this flag will be reset to false and never set to true again. This
leads to misleading information in the UI.
Set it to true whenever we confirm there is an outstanding node
request. This should happen immediately after the freeze which
currently clears the flag, so the window for incorrect data should
be very small.
Change-Id: I2233b7990c0011c4a875e71eded4c31276926154
Operators can use the "{tenant.name}" special word when setting conditions'
values. This special word will be replaced at evaluation time by the
name of the tenant for which the authorization check is being done.
Change-Id: I6f1cf14ad29e775d9090e54b4a633384eef61085
As a preparation for scale-out-scheduler the scheduler in tests
needs to be extracted in order to start multiple instances.
For brevity this change only extracts the instantiation of the
scheduler in the test class with no side changes on existing
tests. Further changes will follow to step-by-step prepare
test base for scale-out-scheduler changes.
Change-Id: If8a55873c580b66137476d20d4bfcb5a22544522
Story: 2007192
Currently an executor still executes merge jobs even when it's
paused. This is surprising to the user and an operational problem when
having a misbehaving executor for some reason. Further the merger now
also can be paused explicitly.
Change-Id: I7ebf2df9d6648789e6bb2d797edd5b67a0925cfc
Normally, if a change is made to a job configuration, we want to
ignore the file matchers and run the job so that we can see the
update in practice. However, we don't want to do so if the
*only* change to the job configuration is the file matchers. The
result is no different than before, and running the job tells us
nothing about the value of the change.
This behavior was already the case, but due to a bug. This
change both fixes the bug and intentionally restores the
behavior.
The bug that is corrected is that when variants were applied, the
actual file matchers attached to the frozen job were updated, but
not their string/list representation (stored in the '_files' and
'_irrelevant_files' attributes). This change moves those
attributes to the list of context attributes (which are
automatically copied on inheritance) -- the same as the actual
matcher attributes.
We also do the same for the '_branches' attribute (the text
version of the branch matcher). These changes make the
serialized form of the frozen job (which is returned by the web
api) more accurate. And in the case of the files matchers,
causes Zuul to correctly detect that the job has changed when
they are updated.
To restore the current (accidental but desired) behavior, we now
pop the files and irrelevant-files keys from the serialized
representation of the job before comparison when deciding whether
to run the job during a config update. This means that it will
only run if something other than a files matcher is changed.
This change also removes the _implied_branch job attribute
because it is no longer used anywhere (implied branches now
arrive via projects and project templates).
Change-Id: Id1d595873fe87ca980b2f33832f55542f9e5d496
The fail-fast mechanism should not be triggered on a build that got
retried. This is untested so far so add a test case for it.
Change-Id: I2c5671072a33f0513b566e7a90254663657d39d1
A jitter value for the periodic pipeline in
`test_timer_with_jitter` renders this test
flaky, since we can not know when exactly the
pipeline gets triggered.
This change makes the test case deterministic
by replacing sleeps with `iterate_timeout`.
Change-Id: I2dd5222fbf33fbee234b6e1ae577db44004a2e12
For periodic pipelines to not all trigger at the same moment and thus
potentially bear a lot of load on a system, we could spread the trigger
times by a jitter value. The APScheduler library Zuul uses for timer
triggers supports such jitter values.
This change allows one to pass that jitter value within the cron-like
'time' parameter of the timer trigger.
The optional Jitter parameter can be passed as the last parameter after
the optional 'seconds'.
Change-Id: I58e756dff6251b49a972b26f7a3a49a8ee5aa70e
Currently we only can modify the tenant configuration by triggering a
full reconfiguration. However with many large tenants this can take a
long time to finish. Zuul is stalled during this process. Especially
when the system is at quota this can lead to long job queues that
build up just after the reconfiguration. This adds support for a smart
reconfiguration that only reconfigures tenants that changed their
config. This can speed up the reconfiguration a lot in large
multi-tenant systems.
Change-Id: I6240b2850d8961a63c17d799f9bec96705435f19
Currently the algorithm for canceling is as follows:
1. check url of build, if exists abort and return
2. remove from gearman queue and return if successful
3. sleep a second
4. check url of build again
The url part of the build is used to determine if it has been started
on the executor. However the url is actually set just before the
ansible playbooks are started after all repo fetching and merging has
happened. This most of the time takes more than the one second wait
time in step three. This leaves a large window open where builds
cannot be canceled at all and will run completely.
This can be fixed by reporting the worker_name earlier just after the
job has been accepted and use the worker name instead of the url to
indicate if the job is running.
Change-Id: I37925c7a6207e8d92d47097c9d66200343b3d090
The patchset or ref, pipeline and project should be enough to trigger an
enqueue. The trigger argument is not validated or used anymore when
enqueueing via RPC.
Change-Id: I9166e6d44291070f01baca9238f04feedcee7f5b
In case a job is removed from a running project pipeline during
reconfiguration, the build result should not be process.
If the build is processed anyway, it might have unwanted side effects
e.g. canceling running builds in case fail-fast is enabled.
Change-Id: I857f60b1c4adf5cceb8bd3df6e6fe49a22114f9d
When the window size changes, items formerly in the window might be no
longer active. But those items can already have build results. During
reconfiguration the job graph is set to None, but the item is not
prepared if no longer active. This becomes problematic when
build.setResult() tries to access the job graph when resetting builds.
This can be solved by also preparing the item if there was already a job
graph. The other option to fix this, would be to never change the
item.active state back to False after it became active. However, this
would be a change in the current behavior of Zuul, as only as builds
finish the active window would actually decrease.
Exceptions:
2019-12-03 08:11:06,298 zuul.Scheduler ERROR Exception while re-enqueing item <QueueItem 0x7f54285a49b0 for <Change 0x7f5442939da0 org/project 4,1> in gate>
Traceback (most recent call last):
File "/tmp/zuul/zuul/scheduler.py", line 931, in _reenqueueTenant
item_ahead_valid=item_ahead_valid)
File "/tmp/zuul/zuul/manager/__init__.py", line 276, in reEnqueueItem
item.setResult(build)
File "/tmp/zuul/zuul/model.py", line 2642, in setResult
to_skip = self.job_graph.getDependentJobsRecursively(
AttributeError: 'NoneType' object has no attribute 'getDependentJobsRecursively'
2019-12-03 08:11:11,401 zuul.Scheduler ERROR Exception in run handler:
Traceback (most recent call last):
File "/tmp/zuul/zuul/scheduler.py", line 1133, in run
self.process_result_queue()
File "/tmp/zuul/zuul/scheduler.py", line 1268, in process_result_queue
self._doBuildCompletedEvent(event)
File "/tmp/zuul/zuul/scheduler.py", line 1473, in _doBuildCompletedEvent
pipeline.manager.onBuildCompleted(event.build)
File "/tmp/zuul/zuul/manager/__init__.py", line 995, in onBuildCompleted
item.setResult(build)
File "/tmp/zuul/zuul/model.py", line 2642, in setResult
to_skip = self.job_graph.getDependentJobsRecursively(
AttributeError: 'NoneType' object has no attribute 'getDependentJobsRecursively'
Change-Id: I96d4b28a017a1371029ff7b8d516beb09b6b94ad
When the queue for a project is changed during reconfiguration the
dependent pipeline manager was using the queue of the item
ahead in the old queue.
This lead to items ending up in the wrong queues and caused the
following exception in the run handler:
2019-11-28 13:33:47,313 zuul.Scheduler ERROR Exception in run handler:
Traceback (most recent call last):
File "/tmp/zuul/zuul/scheduler.py", line 1145, in run
while (pipeline.manager.processQueue() and
File "/tmp/zuul/zuul/manager/__init__.py", line 914, in processQueue
item, nnfi)
File "/tmp/zuul/zuul/manager/__init__.py", line 898, in _processOneItem
priority = item.getNodePriority()
File "/tmp/zuul/zuul/model.py", line 2684, in getNodePriority
return self.pipeline.manager.getNodePriority(self)
File "/tmp/zuul/zuul/manager/dependent.py", line 101, in getNodePriority
return items.index(item)
ValueError: <QueueItem 0x7f02583af4e0 for <Change 0x7f02583a9cf8 org/project2 2,1> in gate> is not in list
The issue is fixed, by ignoring the given existing queue in the
dependent pipeline manager, since it's always possible to get the
correct queue from the pipeline itself.
Change-Id: Ia5b1b58377e4420b9ab1440c0b9f67cb15967263
Instead of storing a flat list of nodes per hold request, this
change updates the request nodes attribute to become a list of
dictionary with the build uuid and the held node list.
Change-Id: I9e50e7ccadc58fb80d5e80d9f5aac70eb7501a36
Change I1fe06d7a889dbccf86116a310701b2755e5fb385 updated the model
to handle the case where zuul_return is used to filter the child
jobs when those jobs have soft dependencies. In short, a filtered
job should no longer be counted as a dependency when its child job
runs (as if the filtered job had not matched the change in the first
place) which makes the behavior consistent with soft dependencies
of jobs which don't match.
However, it also changed the behavior for when a job fails. This
seems to be a simple oversight in applying the new "skip_soft"
pattern in too many places. When a job actually fails, we do not
want its child jobs to run, regardless of whether they have a hard
or soft dependency on the failing parent. In that case, we should
always set them to 'skipped'. Otherwise they will never run and
the item will sit in the queue indefinitely.
Change-Id: Ic19a73d1b6e0fbe5510e71947930de4c46cc1280
When a request is created with a node expiration, set a request
expiration for 24 hours after the nodes expire.
Change-Id: I0fbf59eb00d047e5b066d2f7347b77a48f8fb0e7
Marking the nodes as USED will allow nodepool to delete them.
If we are unsuccessful in marking any of the held nodes as used,
we simply log the error and try again at some future point until
all nodes are eventually marked, allowing the hold request to be
deleted.
Change-Id: Idd41c58b5cce0aa9b6cd186fa5c33066012790b8
This adds max_hold_expiration and default_hold_expiration as
scheduler options.
max_hold_expiration sets the absolute maximum age, in seconds,
a node placed in the hold state will remain available. This
defaults to 0, which means there is no maximum.
default_hold_expiration sets the default value used if no value
is supplied. This defaults to max_hold_expiration.
Change-Id: Ia483ac664e0a2adcec9efb29d3d701f6d315ef3b
New command for the zuul CLI client to retrieve autohold details.
Currently, the only new information included is the 'current_count'
field, but this will later be extended to include held nodes.
Change-Id: Ieae2aea73123b5467d825d4738be07481bb15348
Storing autohold requests in ZooKeeper, rather than in-memory,
allows us to remember requests across restarts, and is a necessity
for future work to scale out the scheduler.
Future changes to build on this will allow us to store held node
information with the change for easy node identification, and to
delete any held nodes for a request using the zuul CLI.
A new 'zuul autohold-delete' command is added since hold requests
are no longer automatically deleted.
This makes the autohold API:
zuul autohold: Create a new hold request
zuul autohold-list: List current hold requests
zuul autohold-delete: Delete a hold request
Change-Id: I6130175d1dc7d6c8ce8667f9b14ae9377737d280
This adds a tenant option to use the Zuul web build page as the
URL reported to the code review system when a build completes.
The setting is per-tenant (because it requires that the tenant
have a working SQL reporter configured in all pipelines) and
defaults to false, since we can't guarantee that. In the future,
we expect to make SQL reporting implicit, then this can default
to true and eventually be deprecated.
A new zuul.conf option is added and marked required to supply
the root web URL. As we perform further integration with the web
app, we may be able to deprecate other similar settings, such
as "status_url".
Change-Id: Iaa3be10525994722d020d2aa5a7dcf141f2404d9
Add an "authorize_user" RPC call allowing to test a set of claims
against the rules of a given tenant. Make zuul-web use this call
to authorize access to tenant-scoped privileged actions.
Change-Id: I50575f25b6db06f56b231bb47f8ad675febb9d82
We currently have five gearman worker in the system which are all
similar but different. In preparation of adding a sixth worker
refactor them to all re-use a central class and the same config and
dispatch mechanism.
Change-Id: Ifbb4c5aec28fe5b044569d365a4e3fe31150eb3b
If Zuul was unable to freeze the job graph of the previous layout,
then it would throw an exception when calculating whether jobs have
changed. This is a problem if it does so on a change which fixes
an error in the current tenant config, as it means the fix can not
merge.
To handle this case, we need to catch that exception, and then provide
a default value indicating whether the job config has been updated
or not. Because returning True could cause every job with a file
matcher to run unecessarily, the safer default is False (so the job
relies on the file matcher alone).
Change-Id: I8a72e57e068b073e37274cdabeacffc9daee2e94
The recent change to diff job configs in order to run modified jobs
even if file matchers don't match compares the job config in the
change under test to the most recent layout ahead in the queue, or
the running tenant layout if none, in order to determine if a job's
config has changed. In an independent pipeline with multiple
dependent changes, we did not generate layouts for non-live items,
therefore the only layout available to diff was the running tenant
layout. This would cause changes to run too many jobs in that
situation (while they would run the correct jobs in a dependent
pipeline since layouts of items ahead were available).
To correct this, always generate layouts for items with config updates
regardless of whether they are live, and ensure that every item in
the queue has a pointer to a layout (whether that is its own layout
or the layout of the item ahead, or the tenant layout) so that the
next item can diff against it.
This means we are potentially running more merge operations than before.
The TestNonLiveMerges test is a poor match for the new system, since
every change in it contained a configuration update. So a test which
previously was used to verify that we did not perform a merge for every
item now correctly performs a merge for every item. Since that is
no longer a useful test, it is split into two simpler tests; one which
verifies that non-live items without config updates do not perform merges,
and one that repeats the same scenario with config updates and verifies
that we do perform merges for non-live items in that case (as a sort
of control for the other test).
This also changes how errors are reported for a series of changes with
a config error. Previously in the case of a non-live item with a
config error followed by a live item without a config error, we would
report the config error of the non-live item on the live item, because
Zuul was unable to determine which change was responsible for the error
(since only the live item got a merge and layout). Now that both items
have layouts, Zuul can determine that it does not need to report the
non-live item's error on the live item. However, we still report an
error on the live item since it can not merge as-is. We now report
an abbreviated error: "This change depends on a change with an invalid
configuration."
Change-Id: Id533772f35ebbc76910398e0e0fa50a3abfceb52
This causes file matchers to automatically match when the
configuration of the job itself changes. This can be used instead
of matching "^.zuul.yaml$" which may cause too many jobs to run
in larger repos.
Change-Id: Ieddaead91b597282c5674ba99b0c0f387843c722
By adding that attribute in the formater it is then possible to
report the buildset status URL as reporter start message.
For instance:
start-message: Build started. Ephemeral buildset status {status_url}
/{pipeline.tenant.name}/status/change/{change.number},
{change.patchset}.
Change-Id: I5f7503e4babc6b84b20292f2063ffd90cb6065d9
A while ago, we removed the assertion that there were no leaked
git repos because we were unable to make it race-free (potentially
due to changes in memory management in py3). We inadvertantly
left it in place for one test: test_executor_shutdown, but recent
changes (additions of more tests, etc) have caused the error to
appear there as well. Remove the assertion but leave the debug
log as a clue for future errors. This makes the behavior the same
for all tests.
Also, add in a try/except handler around disabling the garbage
collector.
Change-Id: Ib503ea55fd8ddc3fda1002c7edf1a5334a0ad06f
Fix random failures seens in test_scheduler.TestExecutor.
Saw random failures sometime on py35 sometime on py36 test
on the pagure driver patch: https://review.opendev.org/604404/
With that patch I no longer see the random failures.
This disable is already done in tests/base.py assertFinalState.
Change-Id: I156f85cacd8c14e0205f629dc273d1d9c390ab81
We currently lack means to support resource accounting of tenants or
projects. Together with an addition to nodepool that adds resource
metadata to nodes we can emit statsd statistics per tenant and per
project.
The following statistics are emitted:
* zuul.nodepool.resources.tenant.{tenant}.{resource}.current
Gauge with the currently used resources by tenant
* zuul.nodepool.resources.project.{project}.{resource}.current
Gauge with the currently used resources by project
* zuul.nodepool.resources.tenant.{tenant}.{resource}.counter
Counter with the summed usage by tenant. e.g. cpu seconds
* zuul.nodepool.resources.project.{project}.{resource}.counter
Counter with the summed usage by project. e.g. cpu seconds
Depends-On: https://review.openstack.org/616262
Change-Id: I68ea68128287bf52d107959e1c343dfce98f1fc8
If we have an event we should submit its id also to the merger so
we're able to trace merge operations via an event id.
Change-Id: I12b3ab0dcb3ec1d146803006e0ef644e485a7afe
This patch fix a misbehavior of the dequeue mechanics do to
not taking care of the project name.
When mulitple ref changes with same ref (eg. refs/heads/master) but
different projects are in the same pipeline then the dequeue cmd will
always dequeue the change on the top even if the project name
does not match.
Change-Id: Iea0ec75ca4a805feee61de2036ffd4efb511a50a
In some cases like resource constrained environments it is beneficial
to report on changes in a fail fast manner to immediately report if
one job fails. This can be useful especially if a project has many
expensive long-running jobs. This introduces a fail-fast flag in the
project pipeline that let's the project choose the trade off between
full information and quick feedback.
Change-Id: Ie4a5ac8e025362dbaacd3ae82f2e8369f7447a62