This is more zuul debugging documentation. Change-Id: I5298f62658cd68f2bd19ec02fb2c1970d855bf84
13 KiB
- title
-
Zuul
Zuul
Zuul is a pipeline-oriented project gating system. It facilitates running tests and automated tasks in response to Code Review events.
At a Glance
- Hosts
-
- https://zuul.opendev.org
- zuul*.opendev.org
- ze*.opendev.org
- zm*.opendev.org
- Configuration
-
zuul/main.yaml
zuul.d
- Projects
- Bugs
- Resources
- Chat
-
#zuul:opendev.org
on Matrix
Overview
The OpenDev project uses a number of pipelines in Zuul:
- check
-
Newly uploaded patchsets enter this pipeline to receive an initial +/-1 Verified vote.
- gate
-
Changes that have been approved by core reviewers are enqueued in order in this pipeline, and if they pass tests, will be merged.
- post
-
This pipeline runs jobs that operate after each change is merged.
- pre-release
-
This pipeline runs jobs on projects in response to pre-release tags.
- release
-
When a commit is tagged as a release, this pipeline runs jobs that publish archives and documentation.
- silent
-
This pipeline is used for silently testing new jobs.
- experimental
-
This pipeline is used for on-demand testing of new jobs.
- periodic
-
This pipeline has jobs triggered on a timer for e.g. testing for environmental changes daily.
- promote
-
This pipeline runs jobs that operate after each change is merged in order to promote artifacts generated in the gate pipeline.
Zuul watches events in Gerrit (using the Gerrit "stream-events" command) and matches those events to the pipelines above. If a match is found, it adds the change to the pipeline and starts running related jobs.
The gate pipeline uses speculative execution to improve throughput. Changes are tested in parallel under the assumption that changes ahead in the queue will merge. If they do not, Zuul will abort and restart tests without the affected changes. This means that many changes may be tested in parallel while continuing to assure that each commit is correctly tested.
Zuul's current status may be viewed at https://zuul.opendev.org/.
Zuul's configuration is stored in zuul/main.yaml
. Anyone may propose a change to the
configuration by editing that file and submitting the change to Gerrit
for review.
For the full syntax of Zuul's configuration file format, see the Zuul reference manual.
Sysadmin
Zuul has three main subsystems:
- Zuul Scheduler
- Zuul Executors
- Zuul Web
that in OpenDev's deployment depend on four 'external' systems:
- Nodepool
- Zookeeper
- gear
- MySQL
Scheduler
The Zuul Scheduler and gear are all co-located on a single host,
referred to by the zuul.opendev.org
CNAME in DNS.
Zuul is stateless, so the server does not need backing up. However Zuul talks through git and ssh so you will need to manually check ssh host keys as the zuul user.
e.g.:
sudo su - zuul
ssh -p 29418 review.opendev.org
The Zuul Scheduler talks to Nodepool using Zookeeper and distributes work to the executors using gear.
OpenDev's Zuul installation is also configured to write job results
into a MySQL database via the SQL Reporter plugin. The database for that
is a Rackspace Cloud DB and is configured in the mysql
entry of the zuul_connection_secrets
entry for the
zuul-scheduler
group.
Executors
The Zuul Executors are a horizontally scalable set of servers named
ze*.opendev.org
. They perform git merging operations for
the scheduler and execute Ansible playbooks to actually run jobs.
Our jobs are configured to upload as much information as possible
along with their logs, but if there is an error which can not be
diagnosed in that manner, logs are available in the
executor-debug
log file on the executor host. You may use
the Zuul build UUID to track assignment of a given job from the Zuul
scheduler to the Zuul executor used by that job.
It is safe, although not free, to restart executors. If an executor goes away the scheduler will reschedule the jobs it was originally running.
Web
Zuul Web is a horizontally scalable service. It is currently running
colocated with the scheduler on zuul.opendev.org
. Zuul Web
provides live console streaming and is the home of various web
dashboards such as the status page.
Zuul Web is stateless so is safe to restart, however restarting it will result in a loss of connection for anyone watching a live-stream of a console log when the restart happens.
Restarting Zuul Services
Currently the safest way to restart the Zuul scheduler is to restart all services at the same time. The reason for this is that if the scheduler is restarted, but executors are not, then the executors and scheduler can get out of sync with each other. Note that restarting Zuul Web or a single executor should continue to be safe as noted above, but this process should generally be preferred.
Zuul Scheduler restarts are disruptive, so non-emergency restarts should always be scheduled for quieter times of the day, week and cycle. We should attempt to be courteous and avoid restarts when project teams are cutting releases or have other important changes that are about to land.
Since Zuul is stateless, some work needs to be done to save and then
re-enqueue patches when restarts are done. To accomplish this, start by
running the zuul-changes
script to save the check and gate
queues:
root@zuul02# ~root/zuul-changes.py https://zuul.opendev.org >queues-$(date +%Y%m%d).sh
The resulting script will be executed when Zuul is up and running again to restore the previous queue contents.
One other thing to consider before restarting all zuul services is
you may want to update all of the zuul docker images. This can be useful
if restarting Zuul to correct a bug that was fixed in the Zuul codebase.
To do this run the zuul_pull.yaml
playbook from bridge:
root@bridge# ansible-playbook -f20 /home/zuul/src/opendev.org/opendev/system-config/playbooks/zuul_pull.yaml
Once ready to restart all Zuul services you will want to run the
zuul_restart.yaml
playbook from bridge to do this:
root@bridge# ansible-playbook -f20 /home/zuul/src/opendev.org/opendev/system-config/playbooks/zuul_restart.yaml
Once this playbook is done running, the services will have been restarted, but the Zuul system still needs to load its configs before it is ready to do work. The root of the Zuul dashboard will show you loaded tenants. Once all tenants show up on this page, it is safe to proceed with re-enqueing changes to pipelines with the script we generated earlier. Note that the OpenStack tenant takes the most time. If you wait for it to show up in the dashboard you should be ready to go. You can double check this by loading the OpenStack Zuul status and ensuring it doesn't report an error.
To re-enqueue, execute the previously generated script:
root@zuul# bash queues-$(date +%Y%m%d).sh
When this has completed you are done with the Zuul restart. Please log the restart and any Zuul version update with statusbot in IRC.
Secrets
In some cases it may be warranted to compare the decrypted plaintext of a secret from job configuration against a reference value while troubleshooting, since random padding means encrypting the same plaintext a second time will result in wholly different ciphertext. In order to avoid unintentional disclosure this should only be done when absolutely necessary, but it's possible to decrypt a secret locally on the scheduler server. The first step is extracting the key data from our daily key backups:
root@zuul# jq --raw-output '.keys."/keystorage/gerrit/opendev/opendev%2Fsystem-config".keys[0].private_key' /var/lib/zuul/zuul-keys-backup.json
The name between the double quotes is the path to the project's keys
in ZooKeeper. To construct this you will need to know the Zuul
connection name and full project name. The connection name in the
example above is 'gerrit', replace it with the appropriate connection
name for the project you are looking at. Next is the unique project
name. In the example above we start with
opendev/system-config
and split it on /
.
Everything before the first /
is the next component of our
name in this case, opendev
. Then we take the entire name
opendev/system-config
and URL encode it to get
opendev%2Fsystem-config
which becomes our last
component.
Save the output of this jq command to a file secret.pem
.
Then extract the secret ciphertext from the job configuration to remove
surrounding YAML (there is no need to recombine split lines) and run the
following command to decrypt:
cat ciphertext.txt | sed 's/^ *//' | base64 -d | sudo openssl rsautl -decrypt -oaep -inkey \
secret.pem
Debugging Problems
Occasionally you'll have a job enter an error state or an entire change that appears to be stuck in a Zuul pipeline. Debugging these problems can be a bit daunting to start as Zuul's logs are quite verbose. The good news is that once you learn a few tricks those verbose logs become quite the powerful tool.
Often the best place to start is grepping the Zuul scheduler debug log for the pipeline entry identifier (eg change number, tag, or ref sha1):
you@zuul02$ grep 123456 /var/log/zuul/debug.log
you@zuul02$ grep c6229660cda0af42ecd5afbe7fefdb51136a0436 /var/log/zuul/debug.log
In many of these log lines you'll see Zuul event IDs like
[e: 1718628fe39643e1bd6a88a9a1477b4f]
. This ID identifies
the event that triggered Zuul to take action for these changes and is
logged through all the Zuul services. It can be very powerful to do a
grep on this event ID and trace through the actions that the scheduler
took for this event:
you@zuul02$ grep 1718628fe39643e1bd6a88a9a1477b4f /var/log/zuul/debug.log
This might lead you to look at executor logs where you can use the same ID to grep for actions related to this even on the executor:
you@ze01$ grep 1718628fe39643e1bd6a88a9a1477b4f /var/log/zuul/executor-debug.log
As you trace through the logs related to a change or event ID you can
look for ERROR
or Traceback
messages to try
and identify the underlying source of the problem. Note that
Traceback
messages are not prefixed with the event ID which
means you'll have to grep with additional context, for example using
grep -B20 -A20
.
Another useful debugging tool is Zuul's SIGUSR2 handler. This signal handler produces a thread dump in the debug log and toggles the yappi python profiler. Each Zuul service supports the signal handler and it can be triggered via:
you@zuul02$ sudo kill -USR2 $ZUUL_PID
To determine $ZUUL_PID
you can run ps
against the zuul-*
service that you are interested in
getting information from. For example:
you@zuul02$ ps -ef | grep zuul-scheduler
zuuld 1893030 1893010 0 08:33 ? 00:00:00 /usr/bin/dumb-init -- /usr/local/bin/zuul-scheduler -f
zuuld 1893052 1893030 69 08:33 ? 07:57:42 /usr/local/bin/python /usr/local/bin/zuul-scheduler -f
zuuld 1893198 1893052 0 08:33 ? 00:03:22 /usr/local/bin/python /usr/local/bin/zuul-scheduler -f
All of the zuul services are run under dumb-init
. The
process to send SIGUSR2 to is the child of the dumb-init
process. In the example above $ZUUL_PID
would be
1893052
.
The first time you run it you will turn on the yappi profiler. This profiler does incur a runtime cost which can significantly slow down Zuul's processing of pipelines. Be sure to resend the signal once you have let Zuul run long enough to collect a representative set of profiler data. In most cases a minute or two should be sufficient. Slow memory leaks may require hours, but running Zuul under yappi for hours isn't practical.
GitHub Projects
OpenStack does not use GitHub for development purposes, but there are some non-OpenStack projects in the broader ecosystem that we care about who do. When we are interested in setting up jobs in Zuul to test the interaction between OpenStack projects and those ecosystem projects, we can add the OpenDev Zuul GitHub app to those projects, then configure them in Zuul.
In order to add the GitHub app to a project, an admin on that project should navigate to the OpenDev Zuul app in the GitHub UI. From there they can click "Install", then choose the project or organization they want to install the App on.
The repository then needs to be added to the
zuul/main.yaml
file before Zuul can be configured to
actually run jobs on it.
Information about the configuration of the OpenDev Zuul App itself
can be found on the github
page at openstack_zuul_app
.