In what looks like a typo, we are overriding the bridge node for this
test to a bionic host. Remove this. This was detected by testing an
upgraded Ansible, which wouldn't install on the lower python on
Bionic.
Change-Id: Ie3e754598c6da1812e74afa914f50d91972012cd
These images have a number of issues we've identified and worked
around. The current iteration of this change is essentially
identical to upstream but with a minor tweak to allow the latest
mailman version, and adjusts the paths for hyperkitty and postorius
URLs to match those in the upstream mailman-web codebase, but
doesn't try to address the other items. However, we should consider
moving our fixes from ansible into the docker images where possible
and upstream those updates.
Unfortunately upstream hasn't been super responsive so far hence this
fork. For tracking purposes here are the issues/PRs we've already filed
upstream:
https://github.com/maxking/docker-mailman/pull/552https://github.com/maxking/docker-mailman/issues/548https://github.com/maxking/docker-mailman/issues/549https://github.com/maxking/docker-mailman/issues/550
Change-Id: I3314037d46c2ef2086a06dea0321d9f8cdd35c73
Grab the make logs from the dkms directory. This is helpful if the
modules are failing to build.
The /var/lib/dkms directory contains all the source and object files,
etc., which seems unnecessary to store in general. Thus we just trim
this to the log directory.
Change-Id: I9b5abc9cf4cd59305470a04dda487dfdfd1b395a
This should now be a largely functional deployment of mailman 3. There
are still some bits that need testing but we'll use followup changes to
force failure and hold nodes.
This deployment of mailman3 uses upstream docker container images. We
currently hack up uids and gids to accomodate that. We also hack up the
settings file and bind mount it over the upstream file in order to use
host networking. We override the hyperkitty index type to xapian. All
list domains are hosted in a single installation and we use native
vhosting to handle that.
We'll deploy this to a new server and migrate one mailing list domain at
a time. This will allow us to start with lists.opendev.org and test
things like dmarc settings before expanding to the remaining lists.
A migration script is also included, which has seen extensive
testing on held nodes for importing copies of the production data
sets.
Change-Id: Ic9bf5cfaf0b87c100a6ce003a6645010a7b50358
These were foregotten in I137ab824b9a09ccb067b8d5f0bb2896192291883
when we switched the testing bridge host to bridge99.
Change-Id: I742965c61ed00be05f1daea2d6110413cff99e2a
Gerrit made new releases and we should update to them. Release notes can
be found here:
https://www.gerritcodereview.com/3.5.html#354https://www.gerritcodereview.com/3.6.html#363
The main improvement for us is likely to be the copy approvals
performance boosts and error handling. We still need to run that prior
to our 3.6 upgrade.
Note we currently only run 3.5 in production but we test the 3.6 upgrade
from our current production version so it makes sense to update the 3.6
image as well.
Change-Id: Idf9a16b443907a2d0c19c1b6ec016f5d16583ad2
In thinking harder about the bootstrap process, it struck me that the
"bastion" group we have is two separate ideas that become a bit
confusing because they share a name.
We have the testing and production paths that need to find a single
bridge node so they can run their nested Ansible. We've recently
merged changes to the setup playbooks to not hard-code the bridge node
and they now use groups["bastion"][0] to find the bastion host -- but
this group is actually orthogonal to the group of the same name
defined in inventory/service/groups.yaml.
The testing and production paths are running on the executor, and, as
mentioned, need to know the bridge node to log into. For the testing
path this is happening via the group created in the job definition
from zuul.d/system-config-run.yaml. For the production jobs, this
group is populated via the add-bastion-host role which dynamically
adds the bridge host and group.
Only the *nested* Ansible running on the bastion host reads
s-c:inventory/service/groups.yaml. None of the nested-ansible
playbooks need to target only the currently active bastion host. For
example, we can define as many bridge nodes as we like in the
inventory and run service-bridge.yaml against them. It won't matter
because the production jobs know the host that is the currently active
bridge as described above.
So, instead of using the same group name in two contexts, rename the
testing/production group "prod_bastion". groups["prod_bastion"][0]
will be the host that the testing/production jobs use as the bastion
host -- references are updated in this change (i.e. the two places
this group is defined -- the group name in the system-config-run jobs,
and add-bastion-host for production).
We then can return the "bastion" group match to bridge*.opendev.org in
inventory/service/groups.yaml.
This fixes a bootstrapping problem -- if you launch, say,
bridge03.opendev.org the launch node script will now apply the
base.yaml playbook against it, and correctly apply all variables from
the "bastion" group which now matches this new host. This is what we
want to ensure, e.g. the zuul user and keys are correctly populated.
The other thing we can do here is change the testing path
"prod_bastion" hostname to "bridge99.opendev.org". By doing this we
ensure we're not hard-coding for the production bridge host in any way
(since if both testing and production are called bridge01.opendev.org
we can hide problems). This is a big advantage when we want to rotate
the production bridge host, as we can be certain there's no hidden
dependencies.
Change-Id: I137ab824b9a09ccb067b8d5f0bb2896192291883
Python 3.11 has been released. Once the parent commit of this commit
lands we will have removed our python3.8 images making room for
python3.11 in our image list. Add these new images which will make way
for running and testing our software on this new version of python.
Change-Id: Idcea3d6fa22839390f63cd1722bc4cb46a6ccd53
This switches the bridge name to bridge01.opendev.org.
The testing path is updated along with some final references still in
testinfra.
The production jobs are updated in add-bastion-host, and will have the
correct setup on the new host after the dependent change.
Everything else is abstracted behind the "bastion" group; the entry is
changed here which will make all the relevant playbooks run on the new
host.
Depends-On: https://review.opendev.org/c/opendev/base-jobs/+/862551
Change-Id: I21df81e45a57f1a4aa5bc290e9884e6dc9b4ca13
Run a base test against a Bionic bridge to ensure we don't break
things as we transition the current production host as we move to a
new Focal-based environment.
Change-Id: I1f745a06c4428cf31a166b3d53dd6321bfd41ebc
Following-on from Iffb462371939989b03e5d6ac6c5df63aa7708513, instead
of directly referring to a hostname when adding the bastion host to
the inventory for the production playbooks, this finds it from the
first element of the "bastion" group.
As we do this twice for the run and post playbooks, abstract it into a
role.
The host value is currently "bridge.openstack.org" -- as is the
existing hard-coding -- thus this is intended to be a no-op change.
It is setting the foundation to make replacing the bastion host a
simpler process in the future.
Change-Id: I286796ebd71173019a627f8fe8d9a25d0bfc575a
This replaces hard-coding of the host "bridge.openstack.org" with
hard-coding of the first (and only) host in the group "bastion".
The idea here is that we can, as much as possible, simply switch one
place to an alternative hostname for the bastion such as
"bridge.opendev.org" when we upgrade. This is just the testing path,
for now; a follow-on will modify the production path (which doesn't
really get speculatively tested)
This needs to be defined in two places :
1) We need to define this in the run jobs for Zuul to use in the
playbooks/zuul/run-*.yaml playbooks, as it sets up and collects
logs from the testing bastion host.
2) The nested Ansible run will then use inventory
inventory/service/groups.yaml
Various other places are updated to use this abstracted group as the
bastion host.
Variables are moved into the bastion group (which only has one host --
the actual bastion host) which means we only have to update the group
mapping to the new host.
This is intended to be a no-op change; all the jobs should work the
same, but just using the new abstractions.
Change-Id: Iffb462371939989b03e5d6ac6c5df63aa7708513
Now that all the bridge nodes are Jammy (3.10), we can uncap this
dependency which will bring in the latest selenium. Unfortunately
after investigation the easier way to do things I hoped this would
allow doesn't work; comments are added and small updates for new API.
Update the users file-match so they run too.
Change-Id: I6a9d02bfc79b90417b1f5b3d9431f4305864869c
In prepartion for upgrading this host, run jobs with a Jammy based
bridge.openstack.org.
Since this has a much later Python, it brings in a later version of
selenium when testing (used for screenshots) which has dropped some of
the APIs we use. Pin it to the old version; we will fix this in a
follow-on just to address one thing at a time
(I6a9d02bfc79b90417b1f5b3d9431f4305864869c).
Change-Id: If53286c284f8d25248abf4a1b2edd6951437dec2
In discussion of other changes, I realised that the bridge bootstrap
job is running via zuul/run-production-playbook.yaml. This means it
uses the Ansible installed on bridge to run against itself -- which
isn't much of a bootstrap.
What should happen is that the bootstrap-bridge.yaml playbook, which
sets up ansible and keys on the bridge node, should run directly from
the executor against the bridge node.
To achieve this we reparent the job to opendev-infra-prod-setup-keys,
which sets up the executor to be able to log into the bridge node. We
then add the host dynamically and run the bootstrap-bridge.yaml
playbook against it.
This is similar to the gate testing path; where bootstrap-bridge.yaml
is run from the exeuctor against the ephemeral bridge testing node
before the nested-Ansible is used.
The root key deployment is updated to use the nested Ansible directly,
so that it can read the variable from the on-host secrets.
Change-Id: Iebaeed5028050d890ab541818f405978afd60124
This was missed in the effort to push out Gerrit 3.5.3 as well as the
ssh rsa sha2 fixes. That said it should be mostly fine as all of the
plugins tagged 3.5.2 have tagged the same commit with 3.5.3. Making this
largely a bookkeeping change.
There is one bit that isn't strictly bookkeeping and that is the
plugins/its-base checkout. Against gerrit 3.5 we convert from a master
checkout [0] to a stable-3.5 [1] checkout as this branch exists now.
Against gerrit 3.6 we convert from a stable-3.6 checkout to a master
checkout. I suspect that a stable-3.6 branch existed for a short period
of time and was cleaned up and zuul is using an old cached state.
The change for its-base on gerrit 3.5 does represent a reversion of
three commits but they all seem related to gerrit 3.6 so I expect this
is fine.
[0] https://gerrit.googlesource.com/plugins/its-base/+log/refs/heads/master
[1] https://gerrit.googlesource.com/plugins/its-base/+log/refs/heads/stable-3.5
Change-Id: I619b28fe642ca8b57eb533157ec0a441f6b66890
This adds our first Jammy production server to the mix. We update the
gitea load balancer as it is a fairly simple service which will allow us
to focus on Jammy updates and not various server updates.
We update testing to shift testing to a jammy node as well. We don't
remove gitea-lb01 yet as this will happen after we switch DNS over to
the new server and are happy with it.
Change-Id: I8fb992e23abf9e97756a3cfef996be4c85da9e6f
For some reason this is failing in the gate -- the some reason bit is
hard to determine at the moment. Log the exception.
Change-Id: I13c60c5dfc4ab19d8dec589c96338adc7461c992
I'm not sure why I used this tag; I probably copied it from [1] at the
time? Let's just try latest.
Update matchers so the screenshot jobs run
[1] https://github.com/SeleniumHQ/docker-selenium
Change-Id: I8ea7981dac54883822f3b6076b6f0f564571f018
We want to ensure that the logging apache does for us is sufficient to
trace requests from the load balancer to apache to gitea. To do that we
need to gather the logs and look at them.
Change-Id: I468d37709c1a3c2255b1bfcf38a23bb1a2a75899
Zuul is removing support for old ansible versions. Remove our pin to
old ansible. There shouldn't be any reason for these pins at this point.
Change-Id: I0e0998e0d29d55695c6cd92b10feeb910b086d0a
It is a good idea ot periodically update our base python images. Now is
a good time to do it as we've got debian bullseye updates and python
minor releases. The bullseye updates fix a glibc bug that was affecting
Ansible in the zuul images. With this update we'll be able to remove the
workaround for that issue.
We also update the builder image's apt-get process to include a clean to
match tbe base image. This is more for consistency than anything else.
Finally update job timeouts for builds as it seems we occasionally need
more time particularly for emulated arm64 builds.
Change-Id: I31483ff434f19f408aef3b63cb2cd24044a8bf29
We must have missed this, I noticed when it didn't run on the gate job
for I949c40e9046008d4f442b322a267ce0c967a99dc
Change-Id: I62c5c0f262d9bd53580367dc9f1ad00fe7b6f6f2
We still have some Ubuntu Xenial servers, so cap the max usable pip
and setuptools versions in their venvs like we already do for
Bionic, in order to avoid broken installations. Switch our
conditionals from release name comparisons to version numbers in
order to more cleanly support ranges. Also make sure the borg run
test is triggered by changes to the create-venv role.
Change-Id: I5dd064c37786c47099bf2da66b907facb517c92a
Many of our tests are actually running with a timeout of 3600; I think
between a combination of bumping timeouts for failures and
copy-pasting jobs.
We are seeing frequent timeouts of other jobs without this,
particularly on OVH GRA1. Let's bump the base timeout to 3600 to
account for this. The only job that overrides this now is gitea,
which runs for 4800 due to it's long import process.
Change-Id: I762f0f7c7a53a456d9269530c9ae5a9c85903c9c
Keeping the testing nodes at the other end of the namespace separates
them from production hosts. This one isn't really referencing itself
in testing like many others, but move it anyway.
Change-Id: I2130829a5f913f8c7ecd8b8dfd0a11da3ce245a9
Similar to Id98768e29a06cebaf645eb75b39e4dc5adb8830d, move the
certificate variables to the group definition file, so that we don't
have to duplicate handlers or definitions for the testing host.
Change-Id: I6650f5621a4969582f40700232a596d84e2b4a06
Currently we define the letsencrypt certs for each host in its
individual host variables.
With recent work we have a trusted CA and SAN names setup in
our testing environment; introducing the possibility that we could
accidentally reference the production host during testing (both have
valid certs, as far as the testing hosts are concerned).
To avoid this, we can use our naming scheme to move our testing hosts
to "99" and avoid collision with the production hosts. As a bonus,
this really makes you think more about your group/host split to get
things right and keep the environment as abstract as possible.
One example of this is that with letsencrypt certificates defined in
host vars, testing and production need to use the same hostname to get
the right certificates created. Really, this should be group-level
information so it applies equally to host01 and host99. To cover
"hostXX.opendev.org" as a SAN we can include the inventory_hostname in
the group variables.
This updates one of the more tricky hosts, static, as a proof of
concept. We rename the handlers to be generic, and update the testing
targets.
Change-Id: Id98768e29a06cebaf645eb75b39e4dc5adb8830d
I've seen a couple of jobs timeout on this for no apparent reason.
Loading all the repos just seems to take a long time. Looking at the
logs [1], depending on the cloud taking 55m - 1h is not terribly
uncommon. Increase the timeout on this by 20 minutes to give it
enough headroom over an hour.
[1] https://zuul.opendev.org/t/openstack/builds?job_name=system-config-run-gitea&project=opendev%2Fsystem-config
Change-Id: I51080820bae35ac615a3b8b7ee1b8890e0df8410