Now that the Mailman v3 migration is complete, we no longer need any
divergence between the lists01 (production) and lists99 (test node)
host vars, so put everything into the group vars file instead.
Change-Id: If92943694e95ef261fbd254eff65a51d8d3f7ce5
This uncomments the list additions for the lists.airshipit.org and
lists.katacontainers.io sites on the new mailman server, removing
the configuration for them from the lists.opendev.org server and, in
the case of the latter, removing all our configuration management
for the server as it was the only site hosted there.
Change-Id: Ic1c735469583e922313797f709182f960e691efc
The 1.20 release is here. Upgrade to this version.
Things we change:
* Nodejs is updated to v20 to match the alpine 3.18 package version
that gitea switched to.
* Templates are updated to match upstream 1.20 templates.
* We drop the deprecated LFS_CONTENT_PATH from our server config and
add an equivalent [lfs] config section.
* Normalize app.ini content so that gitea won't write it back out to
disk which fails due to permissions (and we don't want it overriding
our configs anyway). For this we need to add WORK_PATH,
oauth2.JWT_SECRET, and normliazing spacing and quoting for entries.
* Set JWT_SIGNING_PRIVATE_KEY_FILE explicitly to be located at
/data/gitea/jwt/private.pem otherwise gitea attempts to create the
jwt/ directory somewhere it doesn't have permissions to (I think /)
and won't be persisted across containers.
* Replace log.ENABLE_ACCESS_LOG with log.logger.access.MODE = file as
log.ENABLE_ACCESS_LOG is deprecated and doesn't appear to work
anymore. This appears to be a documentation issue or they deprecated
and removed things more quickly than originaly anticipated.
* Add log.ACCESS_LOG_TEMPLATE to readd source port info to the access
logs.
* Add a templates/custom/header.tmpl file to set theme-color as the
config item for this has been removed.
The 1.20.0 changelog [0] lists a number of breaking changes. I have
tried to capture there here as well as potential impacts to us:
* Fix WORK_DIR for docker (root) image (#25738) (#25811)
* We set APP_DATA_PATH to /data/gitea in our app.ini config which
means we aren't relying on the inferred value from WORK_DIR. I
think this isolates us from this chnage. But we can check for any
content in /app/gitea on our running containers to be sure.
Note we hardcode WORK_PATH to /data/gitea because gitea attempts to
write this back to our config file otherwise as a result of this
change.
* Restrict [actions].DEFAULT_ACTIONS_URL to only github or self (#25581) (#25604)
* We disable actions. This shouldn't affect us.
* Refactor path & config system (#25330) (#25416)
* This is related to the first breaking changes. Basically we need
to check our use of WORK_PATH and determine if we need to hardcode
it to something. Probably a good idea given how they keep changing
this on us...
* Fix all possible setting error related storages and added some tests (#23911) (#25244)
* We don't use storage configs. This shouldn't affect us.
* Use a separate admin page to show global stats, remove actions stat (#25062)
* The breaking change only affects the use of Prometheus which we
don't have yet.
* Remove the service worker (#25010)
* Is listed as a breaking change for UI cleanup that we don't need to
cleanup. (ui.USE_SERVICE_WORKER can be removed).
* Remove meta tags theme-color and default-theme (#24960)
* https://github.com/go-gitea/gitea/pull/24960
* Addressed by adding a custome templates/custom/header.tmpl file
that sets this meta tag to the existing value. Note this only
affects mobile clients so needs to be double checked via a mobile
device.
* Use [git.config] for reflog cleaning up (#24958)
* Affects git.reflog config entries and we don' thave any.
* Allow all URL schemes in Markdown links by default (#24805)
* TODO determine if we need to limit link types and add that
change if so. A point release was made to exclude bad types
already. Not sure if there are others we need to add.
* Redesign Scoped Access Tokens (#24767)
* This breaks scoped tokens with scopes that don't exist anymore.
I don't think we use scoped tokens.
* Fix team members API endpoint pagination (#24754)
* They 1 index the pagination of this endpoint now instead of 0
indexing it.
* Rewrite logger system (#24726)
* They made changes to the loggers and encourage people to check
their logs work as expected when upgrading. Using our test instance
logs I don't see anything that is a problem.
* Increase default LFS auth timeout from 20m to 24h (#24628)
* We don't LFS but can change the timeout if necssary.
* Rewrite queue (#24505)
* Check for 'Removed queue option:' log entries and clean up
corresponding entries in app.ini. We don't have any of these
entries in our logs.
* Remove unused setting time.FORMAT (#24430)
* We didn't have this entry in app.ini.
* Refactor setting.Other and remove unused SHOW_FOOTER_BRANDING (#24270)
* This setting can be removed from app.ini, but we don't set it.
* Correct the access log format (#24085)
* We uncorrect it because they removed source port info in the
correction step. They did this because some log parsers don't
understand having the port info present, but if you are behind a
reverse proxy this information is very important. We run gitea behind
a reverse proxy.
* Reserve ".png" suffix for user/org names (#23992)
* .png is no longer a valid user/org name (it didn't work before
anyway).
* Prefer native parser for SSH public key parsing (#23798)
* If you relied on the openssh ssh-keygen executable for public key
parsing then you must explicitly set config to use it. I don't
think we do as the golang native parser should handle the keytypes
we use.
* Editor preview support for external renderers (#23333)
* This removed an app.ini settings we don't seem to set.
* Add Gitea Profile Readmes (#23260)
* Readmes in .profile repositories will always be shown now. We don't
have .profiles repos so this doesn't affect us.
* Refactor ctx in templates (#23105)
* This affects custom templates as we may need to replace ctx with
ctxData in our templates.
* I've searched our templates for 'root', 'ctx', and 'ctxData' and
have found no instances. Looking at the files modifying by the
commits related to this change:
bd7f218dce7c01260e1d
we don't seem to override the affected files. I think we are fine
as is.
The 1.20.1 changelog indicates there are no breaking changes, and git
diff shows no changes to the templates between 1.20.0 and 1.20.1.
The 1.20.2 changelog indicates there are no breaking changes, and git
diff shows no changes to the templates between 1.20.1 and 1.20.2.
The 1.20.3 changelog indicates there is a single breaking change:
* Fix the wrong derive path (#26271) (#26318)
* If I'm reading the code correctly, I think the problem was storage
configuration inheriting the base storage config and particularly
the related path. Then when archival storage looked for its config
the path was the root gitea storage path and it would inadverdently
delete all repos when deleting a single repo or something like
that. We don't use these features and these are mirrors anyway so I
don't think this really affects us.
[0] https://github.com/go-gitea/gitea/blob/v1.20.3/CHANGELOG.md
Change-Id: I265f0ad16c0e757a11c1d889996ffe2198625a1a
The tsig_key value is a shared secret between the hidden-primary and
secondary servers to facilitate secure zone transfers. Thus we should
store it once in the common "adns" group, rather than duplicating it
in the adns-primary and ads-secondary.
Change-Id: I600f1ecdfc06bda79b6a4ce77253f489ad515fa5
This switches us to running the services against the etherpad group. We
also define vars in a group_vars file rather than a host specific
file. This allows us to switch testing over to etherpad99 to decouple it
from our production hostnames.
A followup change will add a new etherpad production server that will be
deployed alongside the existing one. This refactor makes that a bit
simpler.
Change-Id: I838ad31eb74a3abfd02bbfa77c9c2d007d57a3d4
This updates our base config to 3.7. This should only be merged as
part of the update process described at
https://etherpad.opendev.org/p/gerrit-upgrade-3.7
Change-Id: I9a1fc4a9f35ed0f60b9899cb9d08aa81995e640b
Firstly, my understanding of "adns" is that it's short for
authoritative-dns; i.e. things related to our main non-recursive DNS
servers for the zones we manage. The "a" is useful to distinguish
this from any sort of other dns services we might run for CI, etc.
The way we do this is with a "hidden" server that applies updates from
config management, which then notifies secondary public servers which
do a zone transfer from the primary. They're all "authoritative" in
the sense they're not for general recursive queries.
As mentioned in Ibd8063e92ad7ff9ee683dcc7dfcc115a0b19dcaa, we
currently have 3 groups
adns : the hidden primary bind server
ns : the secondary public authoratitive servers
dns : both of the above
This proposes a refactor into the following 3 groups
adns-primary : hidden primary bind server
adns-secondary : the secondary public authoritative servers
adns : both of the above
This is meant to be a no-op; I just feel like this makes it a bit
clearer as to the "lay of the land" with these servers. It will need
some considering of the hiera variables on bridge if we merge.
Change-Id: I9ffef52f27bd23ceeec07fe0f45f9fee08b5559a
The last iteration of this donor environment was taken down at the
end of 2022, let's proceed with final config removal for it.
Change-Id: Icfa9a681f052f69d96fd76c6038a6cd8784d9d8d
We haven't used the Packethost donor environment in a very long
time, go ahead and clean up lingering references to it in our
configuration.
Change-Id: I870f667d10cc38de3ee16be333665ccd9fe396b9
The mirror in our Limestone Networks donor environment is now
unreachable, but we ceased using this region years ago due to
persistent networking trouble and the admin hasn't been around for
roughly as long, so it's probably time to go ahead and say goodbye
to it.
Change-Id: Ibad440a3e9e5c210c70c14a34bcfec1fb24e07ce
These dummy variables were for the nodepool.yaml template during
testing, but are no longer referenced. Clean them up.
Change-Id: I717ab8f9b980b363fdddaa28e76cd269b1e4d876
This is just enough to get the cloud-launcher working on the new
Linaro cloud. It's a bit of a manual setup, and much newer hardware,
so trying to do things in small steps.
Change-Id: Ibd451e80bbc6ba6526ba9470ac48b99a981c1a8d
This should only be landed as part of our upgrade process. This change
will not upgrade Gerrit properly on its own.
Note, we keep Gerrit 3.5 image builds and 3.5 -> 3.6 upgrade jobs in
place until we are certain we won't roll back. Once we've crossed that
threshold we can drop 3.5 image builds, add 3.7 image builds, and update
the upgrade testing to perform a 3.6 -> 3.7 upgrade.
Change-Id: I40c4f96cc40edc5caeb32a1af80069ef784967fd
On the old bridge node we had some unmanaged venv's with a very old,
now unmaintained RAX DNS API interaction tool.
Adding the RDNS entries is fairly straight forward, and this small
tool is mostly a copy of some of the bits for our dns api backup tool.
It really just comes down to getting a token and making a post request
with the name/ip addresses.
When the cloud the node is launched as is identified as RAX, this will
automatically add the PTR records for the ip4 & 6 addresses. It also
has an entrypoint to be called manually.
This is added and hacked in, along with a config file for the
appropriate account (I have added these details on bridge).
I've left the update of openstack.org DNS entries as a manual
procedure. Although they could be set automatically with small
updates to the tool (just a different POST) -- details like CNAMES,
etc. and the relatively few servers we start in the RAX mangaed DNS
domains means I think it's easier to just do manually via the web ui.
The output comment is updated.
Change-Id: I8a42afdd00be2595ca73819610757ce5d4435d0a
The dependent change allows us to also post to mastodon. Configure
this to point to fosstodon where we have an opendevinfra account.
Change-Id: Iafa8074a439315f3db74b6372c1c3181a159a474
Depends-On: https://review.opendev.org/c/opendev/statusbot/+/864586
This replaces hard-coding of the host "bridge.openstack.org" with
hard-coding of the first (and only) host in the group "bastion".
The idea here is that we can, as much as possible, simply switch one
place to an alternative hostname for the bastion such as
"bridge.opendev.org" when we upgrade. This is just the testing path,
for now; a follow-on will modify the production path (which doesn't
really get speculatively tested)
This needs to be defined in two places :
1) We need to define this in the run jobs for Zuul to use in the
playbooks/zuul/run-*.yaml playbooks, as it sets up and collects
logs from the testing bastion host.
2) The nested Ansible run will then use inventory
inventory/service/groups.yaml
Various other places are updated to use this abstracted group as the
bastion host.
Variables are moved into the bastion group (which only has one host --
the actual bastion host) which means we only have to update the group
mapping to the new host.
This is intended to be a no-op change; all the jobs should work the
same, but just using the new abstractions.
Change-Id: Iffb462371939989b03e5d6ac6c5df63aa7708513
As a short history diversion, at one point we were trying building
diskimage-builder based images for upload to our control-plane
(instead of using upstream generic cloud images). This didn't really
work because the long-lived production servers led to leaking images
and nodepool wasn't really meant to deal with this lifecycle.
Before this the only thing that needed credentials for the
control-plane clouds was bridge.
Id1161bca8f23129202599dba299c288a6aa29212 reworked things to have a
control-plane-clouds group which would have access to the credential
variables.
So at this point we added
zuul/templates/group_vars/control-plane-clouds.yaml.j2 with stub
variables for testing.
However, we also have the same cloud: variable with stub variables in
zuul/templates/host_vars/bridge.openstack.org.yaml.j2. This is
overriding the version from control-plane-clouds because it is more
specific (host variable). Over time this has skewed from the
control-plane-clouds definition, but I think we have not noticed
because we are not updating the control-plane clouds on the non-bridge
(nodepool) nodes any more.
This is a long way of saying remove the bridge-specific definitions,
and just keep the stub variables in the control-plane-clouds group.
Change-Id: I6c1bfe7fdca27d6e34d9691099b0e1c6d30bb967
We are currently running an all in one jitsi meet service at
meetpad.opendev.org due to connectivity issues for colibri websockets to
the jvb servers. Before we open these up we need to configure the http
server for websockets on the jvbs to do tls as they are on different
hosts.
Note it isn't entirely clear yet if a randomly generated keystore is
sufficient for the needs of the jvb colibri websocket system. If not we
may need to convert an LE provisioned cert and key pair into a keystore.
Change-Id: Ifbca19f1c112e30ee45975112863fc808db39fc9
Keeping the testing nodes at the other end of the namespace separates
them from production hosts. This one isn't really referencing itself
in testing like many others, but move it anyway.
Change-Id: I2130829a5f913f8c7ecd8b8dfd0a11da3ce245a9
Similar to Id98768e29a06cebaf645eb75b39e4dc5adb8830d, move the
certificate variables to the group definition file, so that we don't
have to duplicate handlers or definitions for the testing host.
Change-Id: I6650f5621a4969582f40700232a596d84e2b4a06
Move the paste testing server to paste99 to distinguish it in testing
from the actual production paste service. Since we have certificates
setup now, we can directly test against "paste99.opendev.org",
removing the insecure flags to various calls.
Change-Id: Ifd5e270604102806736dffa86dff2bf8b23799c5
To make testing more like production, copy the OpenDev CA into the
haproxy container configuration directory during Zuul runs. We then
update the testing configuration to use SSL checking like production
does with this cert.
Change-Id: I1292bc1aa4948c8120dada0f0fd7dfc7ca619afd
Some of our testing makes use of secure communication between testing
nodes; e.g. testing a load-balancer pass-through. Other parts
"loop-back" but require flags like "curl --insecure" because the
self-signed certificates aren't trusted.
To make testing more realistic, create a CA that is distributed and
trusted by all testing nodes early in the Zuul playbook. This then
allows us to sign local certificates created by the letsencrypt
playbooks with this trusted CA and have realistic peer-to-peer secure
communications.
The other thing this does is reworks the letsencrypt self-signed cert
path to correctly setup SAN records for the host. This also improves
the "realism" of our testing environment. This is so realistic that
it requires fixing the gitea playbook :). The Apache service proxying
gitea currently has to override in testing to "localhost" because that
is all the old certificate covered; we can now just proxy to the
hostname directly for testing and production.
Change-Id: I3d49a7b683462a076263127018ec6a0f16735c94
We have moved to a situation where we proxy requests to gitea (3000)
via Apache listening on 3081 -- this is useful for layer 7 filtering
like matching on user-agents.
It seems like we missed some of this configuration in our
load-balancer testing. Update the https forward on the load-balancer
to port 3081 on the gitea test host.
Also, remove the explicit port opening in the testing group_vars; for
some reason this was not opening port 3080 (http). This will just use
the production settings when we don't override it.
Change-Id: Ic5690ed893b909a7e6b4074a1e5cd71ab0683ab4
We previously auto updated nodepool builders but not launchers when new
container images were present. This created confusion over what versions
of nodepool opendev is running. Use the same behavior for both services
now and auto restart them both.
There is a small chance that we can pull in an update that breaks things
so we run serially to avoid the most egregious instances of this
scenario.
Change-Id: Ifc3ca375553527f9a72e4bb1bdb617523a3f269e
This updates the gerrit configuration to deploy 3.5 in production.
For details of the upgrade process see:
https://etherpad.opendev.org/p/gerrit-upgrade-3.5
Change-Id: I50c9c444ef9f798c97e5ba3dd426cc4d1f9446c1
As found in Ie5d55b2a2d96a78b34d23cc6fbac62900a23fc37, the default for
this is to issue "OPTIONS /" which is kind of a weird request. The
Zuul hosts currently seem to return the main page content in response
to a OPTIONS request, which probably isn't right.
Make this more robust by just using "HEAD /" request.
Change-Id: Ibbd32ae744af9c33aedd087a8146195844814b3f
Apparently the check-ssl option only modifies check behavior, but
does not actually turn it on. The check option also needs to be set
in order to activate checks of the server. See §5.2 of the haproxy
docs for details:
https://git.haproxy.org/?p=haproxy-2.5.git;a=blob;f=doc/configuration.txt;h=e3949d1eebe171920c451b4cad1d5fcd07d0bfb5;hb=HEAD#l14396
Turn it on for all of our balance_zuul_https server entries.
Also set this on the gitea01 server entry in balance_git_https, so
we can make sure it's still seen as "up" once this change takes
effect. A follow-up change will turn it on for the other
balance_git_https servers out of an abundance of caution around that
service.
Change-Id: I4018507f6e0ee1b5c30139de301e09b3ec6fc494
Switch the port 80 and 443 endpoints over to doing http checks instead
of tcp checks. This ensures that both apache and the zuul-web backend
are functional before balancing to them.
The fingergw remains a tcp check.
Change-Id: Iabe2d7822c9ef7e4514b9a0eb627f15b93ad48e2
Previously we were only checking that Apache can open TCP connections to
determine if Gitea is up or down on a backend. This is insufficient
because Gitea itself may be down while Apache is up. In this situation
TCP connection to Apache will function, but if we make an HTTP request
we should get back an error.
To check if both Apache and Gitea are working properly we switch to
using http checks instead. Then if Gitea is down Apache can return a 500
and the Gitea backend will be removed from the pool. Similarly if Apache
is non functional the check will fail to connect via TCP.
Note we don't verify ssl certs for simplicity as checking these in
testing is not straightforward. We didn't have verification with the old
tcp checks so this isn't a regression, but does represent something we
could try and improve in the future.
Change-Id: Id47a1f9028c7575e8fbbd10fabfc9730095cb541
The sql connection is no longer supported, we need to use "database"
instead. The corresponding hostvars change has already been made
on bridge.
Change-Id: Ibcac56568f263bd50b2be43baa26c8c514c5272b
The actually upgrade will be performed manually, but this change will be
used to update the docker-compose.yaml file.
If we land this change prior to the upgrade then note the
manage-projects commands will be updated to use the 3.4 image possibly
while gerrit 3.3 is still running. I don't expect this to be a problem
as manage-projects operates via network protocols.
Change-Id: I5775f4518ec48ac984b70820ebd2e645213e702a
It appears that simply setting stdin to an empty string is
insufficient to make newlist calls from Ansible correctly look like
they're coming from a non-interactive shell. As it turns out, newer
versions of the command include a -a (--automate) option which does
exactly what we want: sends list admin notifications on creation
without prompting for manual confirmation.
Drop the test-time addition of -q to quell listadmin notifications,
as we now block outbound 25/tcp from nodes in our deploy tests. This
has repeatedly exposed a testing gap, where the behavior in
production was broken because of newlist processes hanging awaiting
user input even though we never experienced it in testing due to the
-q addition there.
Change-Id: I550ea802929235d55750c4d99c7d9beec28260f0