If the final post playbook fails, something has gone wrong with log
uploading, which means it's very hard to debug. Grab the contents of the
json log file, extract the log for the last playbook and add it to the
executor log.
Change-Id: Ia930311e121c350e73e41b20e9b742b2eac9c9f6
During dynamic reconfiguration, we treated deleted files the same
as files which were not present in the collected file set for the
changes in question. This meant that if someone deleted a file
from a project-branch, we would end up using the cached data rather
than no data.
Only use cached data if there is no entry for the project-branch
in the files object. If there is an entry, even if it's empty,
do not use the cached data.
Change-Id: If18bfa12b6c8e9ac5733bfe597f48b90ec49df9e
A recent change began performing full validation on project-pipeline
job variants rather than deferring to the jobparser. However,
we were modifying the config object inside the project-template parser
to add the job name, before passing it to the job parser. This meant
that the template passed validation the first time, but during a
subsequent reconfiguration, the cached value would already have the
name attribute and fail validation.
This change stops adding the name attribute, instead passing it to
the jobparser. It also disables validation by the jobparser in the
case that the project-template parser has already performed it, so that
we don't slow reconfiguration.
Change-Id: I080b4dd01ac40363a932c0853f0f9670b1e2c511
Jobs aborted by the executor are not counted into retry limit.
Extend the test_job_aborted test case to check that.
Change-Id: I47fa60fe8ff9da62cb11e669b11e60233d464794
In project and project-template definitions, the existing voluptuous
schema for the jobs in the job list was vs.Any(str, dict). The contents
of the dict itself need to be validated though, the job being a dict
that looks like:
check:
jobs:
- project-test1:
- required-projects:
org/project2
Is invalid as the contents of the build-openstack-sphinx-docs job dict
should themselves be a string or a dict rather than a list. This updates
the error to be:
Zuul encountered a syntax error while parsing its configuration in the
repo org/project on branch master. The error was:
expected str for dictionary value @ data['check']['jobs'][0]['project-test1']
The error appears in the following project stanza:
project:
name: org/project1
check:
jobs:
- project-test1:
- required-projects:
org/project2
in "org/project/.zuul.yaml@master", line 4, column 3
The error, 'expected str for dictionary value' could probably be
improved at some point, but this is at least an error with a message
which is way better than 'Unknown configuration error'.
Split out the attributes of the job in the JobParser voluptuous schema
that can be used in job lists from the ones that can't. For now it's
only name that can't be used.
Also fix a test fixture that had a trailing : in it.
Change-Id: I217eb5d6befbed51b220d47afa18997a87982389
In case we lose the connection before fully updating all the nodes
associated with a node request, set the request attributes last.
Change-Id: Ib5099005ffb2990940672ce34623bc35b8903739
On a reconfiguration, when we re-enqueue an item, we could end
up creating a job graph for an item before the merge for that
item has returned. Make sure we use the same methods that the
main processing loop uses so that we maintain the state machine
around waiting for merges and dynamic layouts.
Change-Id: I2b99620cfe1f7666ee3e6297ce041b3f7a02e051
When a change alters the layout, we load it twice -- once with
all the config repos so we can check syntax, and a second time
without them (because we don't want to run with unmerged config
repo changes) to generate the layout we actually use.
In the case where there are no untrusted config changes involved,
we will end up producing a layout that's identical to what we would
have used if we didn't generate a new one. And in the case where
there are no trusted config changes involved, the layout we generate
for syntax checking won't be any different than the one we generate
to use.
So in these cases, only generate the layout once. If a stack involves
both trusted and untrusted changes, we will still generate it twice.
Change-Id: Ic97b54af6e4e598225dc65aa3140fb8f1dcfb28e
On full and tenant reconfiguration, we construct new pipeline objects
and re-enqueue all of the items into them. Because of this, we need
to be careful with references to pipelines. Generally we use the
forward references from the Layout, but we also regularly make use
of backward references from QueueItems. When we re-enqueue items
we take care to update these backward references (or clear them
if we are unable to re-enqueue them).
Unfortunately, we missed another backward reference, those from
Build objects. This cas cause old pipelines (and, because they
have backward references to their layouts, entire old layouts) to
persist as long as there is a build in the system referencing them.
And because we keep all of the builds for an item in the previous
buildset records, that can be as long as an item is in the system.
Which can be a very long time on bad days.
To correct, remove the pipeline backref from Build and replace it
with a property method which finds the pipeline by way of the
build's buildset and then its item, which should be safely updated
on re-enqueing, as described above.
Add a test which verifies no extra pipeline objects persist.
Change-Id: I837c0eb8f49ea238a1d5ca2435acb8b245f4a871
If an executor disconnects cleanly or uncleanly it's highly unlikely
to be the fault of the job in question. Always retry the job without
incrementing the retry count.
Change-Id: I3cd18ca2a021f70665609e51247b2110cffcb71c
When we stop zuul-executors, we actually abort running jobs. As a
result of this, we'd like said jobs to be scheduled again onto another
executor.
Change-Id: Ia03e8e69097642656768e70f9d537465180c99b9
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
We should communicate this to the user as we do with TIMED_OUT,
and also not retry it since it is unlikely to be any different
if we run again.
Change-Id: I9371e03c073e3883cdf14bc1a848b7b7d72c9047
If a playbook refers to a role that does not exist, Ansible will output
"ERROR!" followed by an error. Such as:
ERROR! the role 'add-launchpadlib-credentials' was not found in ...
The error appears to have been in '...':
line 5, column 7, but may be elsewhere in the file depending on the exact syntax problem.
The offending line appears to be:
- add-sshkey
- add-launchpadlib-credentials
^
here
Simply adding the error to the build log will help people track down the
issue.
Change-Id: If49c50c16844243cbbade4b7fef7a43df4107d43
In I7e34206d7968bf128e140468b9a222ecbce3a8f1 we modified how
messages are printed for the playbook banner (and now, footer).
The message wasn't formatted properly due to the lack of assignment.
Additionally, we need to add a line break after these messages,
otherwise the next message starts on the same line.
Change-Id: I808d8908815ffa5fae409e600308dfb9ff9c6e77
If our queue processing is slow, and we lose the ZooKeeper session
after a node request has been fulfilled, but before we actually accept
the nodes, we need to be aware of this and not try to use the nodes
given to us.
Also pass the request ID along in the event queue since the actual
request object can have its ID changed out from underneath us on a
resubmit. Compare this ID with the request ID, and also verify that
the actual request still exists.
Furthermore, when we lock nodes, let's make sure that they are actually
allocated to the request we are processing.
Change-Id: Id89f6542afcf3f5d4a0b392b5cb8cf21ec3f6865
The current code checks to see that the destination path shares a prefix
with os.path.curdir. However, os.path.curdir is set to the directory
containing the playbook, not the root of the workdir, which means we're
not excluding things in the trusted dir like we'd like to be doing.
We already set HOME to the root of thew workdir, so we can just switch
the check from os.path.curdir to os.path.expanduser('~') and achieve the
original intent.
Change-Id: Ifac41f74f3306fe74b522c910867f9a5375bd61e
Expose 2 new settings to zuul.conf allowing an operator better
control over them. By default we set the speed limit to 1000 bytes
and speed time to 30 seconds.
Change-Id: I9da80fcfc312cbc12ea11ee7284eaec23adb97c9
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
Using the GIT_HTTP_LOW_SPEED_LIMIT and GIT_HTTP_LOW_SPEED_TIME
environmental variables, we have the ability to support timeouts via
HTTP(S) for git. Currently, it is possible for a merger to block for
ever if something happens to the network.
Change-Id: I8ab245b221eae3a9faacd0b9ba6f7b14b27e6b0e
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
We're removing the infra variant. We've also removed the AFS prep, so we
don't need success url to be different.
Change-Id: I470e9d94044112fc5a5deb3e8dc21c941871f76d
Together, these changes build an OpenStack-sized configuration in
8% of the time it currently takes.
Change-Id: I85f538a7ebdb82724559203e2c5d5380c07f07e7
This may cause the UI to display only a subset of the total changes,
leading to some confusion. Use the local variable that we define to be
set even when change.id is null consistently.
Change-Id: I7c5ff2d9c6ba83e8a8265df3fd83afabe1984fe2
We are moving _emit_playbook_banner from zuul_stream to the
executor because the data that is meant to be printed inside
the banner is not available until v2_playbook_on_play_start which
is triggered on every play. This ensures that the banner is only
printed once per playbook, even if the playbook has multiple plays.
Additionally, we're also printing a 'footer' banner at the end of
each playbook in order to provide an explicit and easily parsable
result from the perspective of Ansible.
This will make it easier for people and systems to parse the job
output to find the result of particular playbooks.
For example, in Zuul v2, Elastic-Recheck would rely on the last lines
of the console to reliably tell when the job was completed and what was
it's status.
This new v3 approach allows increased granularity by providing the
ability to filter down to specific playbooks or even on a phase basis.
Change-Id: I7e34206d7968bf128e140468b9a222ecbce3a8f1
As part of Zuul v2 partial rollback, use the
temporary pipelines.
Change-Id: Id6387e0d702ee3d167d6d43820d61242edb5cbae
Depends-On: I084b819ea0a4b67f36df61ccb3bfd6963cc3940d
* The "Adding .. event" log messages are redundant -- whatever is adding the
event logs it just as well, and it makes filtering for Scheduler messages
difficult. Remove them.
* The gerrit/github connections log both the driver and connection name.
the driver can be inferred from the connection, and very frequently
this just ends up saying "Scheduling gerrit event from gerrit" which
is weird. Remove the driver name.
* Make the pipeline logs say which pipeline they are for. This is way more
useful than the manager name (which is implied by the pipeline name anyway).
Change-Id: I8fcaef87ddcd8428776ee76a4519e4764f2d9c5b
When a config change lands, we clear the cached configuration for
that project before reconfiguring. However, the reconfiguration
is not synchronous. On a busy system, it can be a long time between
the change landing and the reconfiguration. By clearing the cache
synchronously and performing the reconfiguration asynchronously,
any dynamic configurations created in the intervening time will
be missing data.
Instead, keep track of the project which needs to be cleared as
an attribute of the async event, and then clear the config only
immediately before reconfiguration. This happens within the
scheduler main loop, so no other configuration actions can happen
between these two steps now.
Note, there are several changes to tests included in this change.
I used them to create a test which illustrated the bug, however,
I was only able to do so by essentially re-creating the scheduler-
internal sequence in the test itself -- essentially it represented
only the prior erroneous behavior. So while it was useful to
confirm the source of the problem, it is not useful to confirm the
fix, and can not be included in the test suite since it always
fails. The amount of control needed to mimic this sequence in a
test is significantly beyond our facilities at the moment.
However, some of the additional facilities I created may be useful,
so I'm adding them along with this change.
Change-Id: Id7fdd21f1646ee53986be33bdb5f1437558833ba