This uncomments the list additions for the lists.airshipit.org and
lists.katacontainers.io sites on the new mailman server, removing
the configuration for them from the lists.opendev.org server and, in
the case of the latter, removing all our configuration management
for the server as it was the only site hosted there.
The 1.20 release is here. Upgrade to this version.
Things we change:
* Nodejs is updated to v20 to match the alpine 3.18 package version
that gitea switched to.
* Templates are updated to match upstream 1.20 templates.
* We drop the deprecated LFS_CONTENT_PATH from our server config and
add an equivalent [lfs] config section.
* Normalize app.ini content so that gitea won't write it back out to
disk which fails due to permissions (and we don't want it overriding
our configs anyway). For this we need to add WORK_PATH,
oauth2.JWT_SECRET, and normliazing spacing and quoting for entries.
* Set JWT_SIGNING_PRIVATE_KEY_FILE explicitly to be located at
/data/gitea/jwt/private.pem otherwise gitea attempts to create the
jwt/ directory somewhere it doesn't have permissions to (I think /)
and won't be persisted across containers.
* Replace log.ENABLE_ACCESS_LOG with log.logger.access.MODE = file as
log.ENABLE_ACCESS_LOG is deprecated and doesn't appear to work
anymore. This appears to be a documentation issue or they deprecated
and removed things more quickly than originaly anticipated.
* Add log.ACCESS_LOG_TEMPLATE to readd source port info to the access
* Add a templates/custom/header.tmpl file to set theme-color as the
config item for this has been removed.
The 1.20.0 changelog  lists a number of breaking changes. I have
tried to capture there here as well as potential impacts to us:
* Fix WORK_DIR for docker (root) image (#25738) (#25811)
* We set APP_DATA_PATH to /data/gitea in our app.ini config which
means we aren't relying on the inferred value from WORK_DIR. I
think this isolates us from this chnage. But we can check for any
content in /app/gitea on our running containers to be sure.
Note we hardcode WORK_PATH to /data/gitea because gitea attempts to
write this back to our config file otherwise as a result of this
* Restrict [actions].DEFAULT_ACTIONS_URL to only github or self (#25581) (#25604)
* We disable actions. This shouldn't affect us.
* Refactor path & config system (#25330) (#25416)
* This is related to the first breaking changes. Basically we need
to check our use of WORK_PATH and determine if we need to hardcode
it to something. Probably a good idea given how they keep changing
this on us...
* Fix all possible setting error related storages and added some tests (#23911) (#25244)
* We don't use storage configs. This shouldn't affect us.
* Use a separate admin page to show global stats, remove actions stat (#25062)
* The breaking change only affects the use of Prometheus which we
don't have yet.
* Remove the service worker (#25010)
* Is listed as a breaking change for UI cleanup that we don't need to
cleanup. (ui.USE_SERVICE_WORKER can be removed).
* Remove meta tags theme-color and default-theme (#24960)
* Addressed by adding a custome templates/custom/header.tmpl file
that sets this meta tag to the existing value. Note this only
affects mobile clients so needs to be double checked via a mobile
* Use [git.config] for reflog cleaning up (#24958)
* Affects git.reflog config entries and we don' thave any.
* Allow all URL schemes in Markdown links by default (#24805)
* TODO determine if we need to limit link types and add that
change if so. A point release was made to exclude bad types
already. Not sure if there are others we need to add.
* Redesign Scoped Access Tokens (#24767)
* This breaks scoped tokens with scopes that don't exist anymore.
I don't think we use scoped tokens.
* Fix team members API endpoint pagination (#24754)
* They 1 index the pagination of this endpoint now instead of 0
* Rewrite logger system (#24726)
* They made changes to the loggers and encourage people to check
their logs work as expected when upgrading. Using our test instance
logs I don't see anything that is a problem.
* Increase default LFS auth timeout from 20m to 24h (#24628)
* We don't LFS but can change the timeout if necssary.
* Rewrite queue (#24505)
* Check for 'Removed queue option:' log entries and clean up
corresponding entries in app.ini. We don't have any of these
entries in our logs.
* Remove unused setting time.FORMAT (#24430)
* We didn't have this entry in app.ini.
* Refactor setting.Other and remove unused SHOW_FOOTER_BRANDING (#24270)
* This setting can be removed from app.ini, but we don't set it.
* Correct the access log format (#24085)
* We uncorrect it because they removed source port info in the
correction step. They did this because some log parsers don't
understand having the port info present, but if you are behind a
reverse proxy this information is very important. We run gitea behind
a reverse proxy.
* Reserve ".png" suffix for user/org names (#23992)
* .png is no longer a valid user/org name (it didn't work before
* Prefer native parser for SSH public key parsing (#23798)
* If you relied on the openssh ssh-keygen executable for public key
parsing then you must explicitly set config to use it. I don't
think we do as the golang native parser should handle the keytypes
* Editor preview support for external renderers (#23333)
* This removed an app.ini settings we don't seem to set.
* Add Gitea Profile Readmes (#23260)
* Readmes in .profile repositories will always be shown now. We don't
have .profiles repos so this doesn't affect us.
* Refactor ctx in templates (#23105)
* This affects custom templates as we may need to replace ctx with
ctxData in our templates.
* I've searched our templates for 'root', 'ctx', and 'ctxData' and
have found no instances. Looking at the files modifying by the
commits related to this change:
we don't seem to override the affected files. I think we are fine
The 1.20.1 changelog indicates there are no breaking changes, and git
diff shows no changes to the templates between 1.20.0 and 1.20.1.
The 1.20.2 changelog indicates there are no breaking changes, and git
diff shows no changes to the templates between 1.20.1 and 1.20.2.
The 1.20.3 changelog indicates there is a single breaking change:
* Fix the wrong derive path (#26271) (#26318)
* If I'm reading the code correctly, I think the problem was storage
configuration inheriting the base storage config and particularly
the related path. Then when archival storage looked for its config
the path was the root gitea storage path and it would inadverdently
delete all repos when deleting a single repo or something like
that. We don't use these features and these are mirrors anyway so I
don't think this really affects us.
The tsig_key value is a shared secret between the hidden-primary and
secondary servers to facilitate secure zone transfers. Thus we should
store it once in the common "adns" group, rather than duplicating it
in the adns-primary and ads-secondary.
This switches us to running the services against the etherpad group. We
also define vars in a group_vars file rather than a host specific
file. This allows us to switch testing over to etherpad99 to decouple it
from our production hostnames.
A followup change will add a new etherpad production server that will be
deployed alongside the existing one. This refactor makes that a bit
Firstly, my understanding of "adns" is that it's short for
authoritative-dns; i.e. things related to our main non-recursive DNS
servers for the zones we manage. The "a" is useful to distinguish
this from any sort of other dns services we might run for CI, etc.
The way we do this is with a "hidden" server that applies updates from
config management, which then notifies secondary public servers which
do a zone transfer from the primary. They're all "authoritative" in
the sense they're not for general recursive queries.
As mentioned in Ibd8063e92ad7ff9ee683dcc7dfcc115a0b19dcaa, we
currently have 3 groups
adns : the hidden primary bind server
ns : the secondary public authoratitive servers
dns : both of the above
This proposes a refactor into the following 3 groups
adns-primary : hidden primary bind server
adns-secondary : the secondary public authoritative servers
adns : both of the above
This is meant to be a no-op; I just feel like this makes it a bit
clearer as to the "lay of the land" with these servers. It will need
some considering of the hiera variables on bridge if we merge.
The mirror in our Limestone Networks donor environment is now
unreachable, but we ceased using this region years ago due to
persistent networking trouble and the admin hasn't been around for
roughly as long, so it's probably time to go ahead and say goodbye
This is just enough to get the cloud-launcher working on the new
Linaro cloud. It's a bit of a manual setup, and much newer hardware,
so trying to do things in small steps.
This should only be landed as part of our upgrade process. This change
will not upgrade Gerrit properly on its own.
Note, we keep Gerrit 3.5 image builds and 3.5 -> 3.6 upgrade jobs in
place until we are certain we won't roll back. Once we've crossed that
threshold we can drop 3.5 image builds, add 3.7 image builds, and update
the upgrade testing to perform a 3.6 -> 3.7 upgrade.
On the old bridge node we had some unmanaged venv's with a very old,
now unmaintained RAX DNS API interaction tool.
Adding the RDNS entries is fairly straight forward, and this small
tool is mostly a copy of some of the bits for our dns api backup tool.
It really just comes down to getting a token and making a post request
with the name/ip addresses.
When the cloud the node is launched as is identified as RAX, this will
automatically add the PTR records for the ip4 & 6 addresses. It also
has an entrypoint to be called manually.
This is added and hacked in, along with a config file for the
appropriate account (I have added these details on bridge).
I've left the update of openstack.org DNS entries as a manual
procedure. Although they could be set automatically with small
updates to the tool (just a different POST) -- details like CNAMES,
etc. and the relatively few servers we start in the RAX mangaed DNS
domains means I think it's easier to just do manually via the web ui.
The output comment is updated.
This replaces hard-coding of the host "bridge.openstack.org" with
hard-coding of the first (and only) host in the group "bastion".
The idea here is that we can, as much as possible, simply switch one
place to an alternative hostname for the bastion such as
"bridge.opendev.org" when we upgrade. This is just the testing path,
for now; a follow-on will modify the production path (which doesn't
really get speculatively tested)
This needs to be defined in two places :
1) We need to define this in the run jobs for Zuul to use in the
playbooks/zuul/run-*.yaml playbooks, as it sets up and collects
logs from the testing bastion host.
2) The nested Ansible run will then use inventory
Various other places are updated to use this abstracted group as the
Variables are moved into the bastion group (which only has one host --
the actual bastion host) which means we only have to update the group
mapping to the new host.
This is intended to be a no-op change; all the jobs should work the
same, but just using the new abstractions.
As a short history diversion, at one point we were trying building
diskimage-builder based images for upload to our control-plane
(instead of using upstream generic cloud images). This didn't really
work because the long-lived production servers led to leaking images
and nodepool wasn't really meant to deal with this lifecycle.
Before this the only thing that needed credentials for the
control-plane clouds was bridge.
Id1161bca8f23129202599dba299c288a6aa29212 reworked things to have a
control-plane-clouds group which would have access to the credential
So at this point we added
zuul/templates/group_vars/control-plane-clouds.yaml.j2 with stub
variables for testing.
However, we also have the same cloud: variable with stub variables in
zuul/templates/host_vars/bridge.openstack.org.yaml.j2. This is
overriding the version from control-plane-clouds because it is more
specific (host variable). Over time this has skewed from the
control-plane-clouds definition, but I think we have not noticed
because we are not updating the control-plane clouds on the non-bridge
(nodepool) nodes any more.
This is a long way of saying remove the bridge-specific definitions,
and just keep the stub variables in the control-plane-clouds group.
We are currently running an all in one jitsi meet service at
meetpad.opendev.org due to connectivity issues for colibri websockets to
the jvb servers. Before we open these up we need to configure the http
server for websockets on the jvbs to do tls as they are on different
Note it isn't entirely clear yet if a randomly generated keystore is
sufficient for the needs of the jvb colibri websocket system. If not we
may need to convert an LE provisioned cert and key pair into a keystore.
Keeping the testing nodes at the other end of the namespace separates
them from production hosts. This one isn't really referencing itself
in testing like many others, but move it anyway.
Similar to Id98768e29a06cebaf645eb75b39e4dc5adb8830d, move the
certificate variables to the group definition file, so that we don't
have to duplicate handlers or definitions for the testing host.
Move the paste testing server to paste99 to distinguish it in testing
from the actual production paste service. Since we have certificates
setup now, we can directly test against "paste99.opendev.org",
removing the insecure flags to various calls.
To make testing more like production, copy the OpenDev CA into the
haproxy container configuration directory during Zuul runs. We then
update the testing configuration to use SSL checking like production
does with this cert.
Some of our testing makes use of secure communication between testing
nodes; e.g. testing a load-balancer pass-through. Other parts
"loop-back" but require flags like "curl --insecure" because the
self-signed certificates aren't trusted.
To make testing more realistic, create a CA that is distributed and
trusted by all testing nodes early in the Zuul playbook. This then
allows us to sign local certificates created by the letsencrypt
playbooks with this trusted CA and have realistic peer-to-peer secure
The other thing this does is reworks the letsencrypt self-signed cert
path to correctly setup SAN records for the host. This also improves
the "realism" of our testing environment. This is so realistic that
it requires fixing the gitea playbook :). The Apache service proxying
gitea currently has to override in testing to "localhost" because that
is all the old certificate covered; we can now just proxy to the
hostname directly for testing and production.
We have moved to a situation where we proxy requests to gitea (3000)
via Apache listening on 3081 -- this is useful for layer 7 filtering
like matching on user-agents.
It seems like we missed some of this configuration in our
load-balancer testing. Update the https forward on the load-balancer
to port 3081 on the gitea test host.
Also, remove the explicit port opening in the testing group_vars; for
some reason this was not opening port 3080 (http). This will just use
the production settings when we don't override it.
We previously auto updated nodepool builders but not launchers when new
container images were present. This created confusion over what versions
of nodepool opendev is running. Use the same behavior for both services
now and auto restart them both.
There is a small chance that we can pull in an update that breaks things
so we run serially to avoid the most egregious instances of this
As found in Ie5d55b2a2d96a78b34d23cc6fbac62900a23fc37, the default for
this is to issue "OPTIONS /" which is kind of a weird request. The
Zuul hosts currently seem to return the main page content in response
to a OPTIONS request, which probably isn't right.
Make this more robust by just using "HEAD /" request.
Apparently the check-ssl option only modifies check behavior, but
does not actually turn it on. The check option also needs to be set
in order to activate checks of the server. See §5.2 of the haproxy
docs for details:
Turn it on for all of our balance_zuul_https server entries.
Also set this on the gitea01 server entry in balance_git_https, so
we can make sure it's still seen as "up" once this change takes
effect. A follow-up change will turn it on for the other
balance_git_https servers out of an abundance of caution around that
Switch the port 80 and 443 endpoints over to doing http checks instead
of tcp checks. This ensures that both apache and the zuul-web backend
are functional before balancing to them.
The fingergw remains a tcp check.
Previously we were only checking that Apache can open TCP connections to
determine if Gitea is up or down on a backend. This is insufficient
because Gitea itself may be down while Apache is up. In this situation
TCP connection to Apache will function, but if we make an HTTP request
we should get back an error.
To check if both Apache and Gitea are working properly we switch to
using http checks instead. Then if Gitea is down Apache can return a 500
and the Gitea backend will be removed from the pool. Similarly if Apache
is non functional the check will fail to connect via TCP.
Note we don't verify ssl certs for simplicity as checking these in
testing is not straightforward. We didn't have verification with the old
tcp checks so this isn't a regression, but does represent something we
could try and improve in the future.
The actually upgrade will be performed manually, but this change will be
used to update the docker-compose.yaml file.
If we land this change prior to the upgrade then note the
manage-projects commands will be updated to use the 3.4 image possibly
while gerrit 3.3 is still running. I don't expect this to be a problem
as manage-projects operates via network protocols.
It appears that simply setting stdin to an empty string is
insufficient to make newlist calls from Ansible correctly look like
they're coming from a non-interactive shell. As it turns out, newer
versions of the command include a -a (--automate) option which does
exactly what we want: sends list admin notifications on creation
without prompting for manual confirmation.
Drop the test-time addition of -q to quell listadmin notifications,
as we now block outbound 25/tcp from nodes in our deploy tests. This
has repeatedly exposed a testing gap, where the behavior in
production was broken because of newlist processes hanging awaiting
user input even though we never experienced it in testing due to the
-q addition there.
Our deployment tests don't need to send E-mail messages. More to the
point, they may perform actions which would like to send E-mail
messages. Make sure, at the network level, they'll be prevented from
doing so. Also allow all connections to egress from the loopback
interface, so that services like mailman can connect to the Exim MTA
Add new rolevars for egress rules to support this, and also fix up
some missing related vars in the iptables role's documentation.