This reverts commit 95d9b838140e44c9547ad1fa28bc88206823198c.
We've found that we run out of memory at 44g. Bump back up to 48g as
that should give us a bit more headroom.
Change-Id: I14a8f2b298aa1d3cb5c0829508ee137a6769675b
We had been setting this to 48GB on java 8, but recent gerrit service
issues indicate that this may be too large for our current system on
java 11. In particular it appears the non heap portions of the jvm may
be in the ~8GB range leaving only about 5-6GB of usable system memory
for other activities like web servers, backups, and garbage collection.
Reduce this to 44GB to increase headroom to see if that helps us. Java
11 is reported to be much more efficient at garbage collecting so
hopefully that makes up the difference between lower memory and where we
were on java 8. As a side note we could revert back to java 8 as another
option.
Change-Id: Ie326aad2a9895098b484924a26c9257cd009d89e
The hound project has undergone a small re-birth and moved to
https://github.com/hound-search/hound
which has broken our deployment. We've talked about leaving
codesearch up to gitea, but it's not quite there yet. There seems to
be no point working on the puppet now.
This builds a container than runs houndd. It's an opendev specific
container; the config is pulled from project-config directly.
There's some custom scripts that drive things. Some points for
reviewers:
- update-hound-config.sh uses "create-hound-config" (which is in
jeepyb for historical reasons) to generate the config file. It
grabs the latest projects.yaml from project-config and exits with a
return code to indicate if things changed.
- when the container starts, it runs update-hound-config.sh to
populate the initial config. There is a testing environment flag
and small config so it doesn't have to clone the entire opendev for
functional testing.
- it runs under supervisord so we can restart the daemon when
projects are updated. Unlike earlier versions that didn't start
listening till indexing was done, this version now puts up a "Hound
is not ready yet" message when while it is working; so we can drop
all the magic we were doing to probe if hound is listening via
netstat and making Apache redirect to a status page.
- resync-hound.sh is run from an external cron job daily, and does
this update and restart check. Since it only reloads if changes
are made, this should be relatively rare anyway.
- There is a PR to monitor the config file
(https://github.com/hound-search/hound/pull/357) which would mean
the restart is unnecessary. This would be good in the near and we
could remove the cron job.
- playbooks/roles/codesearch is unexciting and deploys the container,
certificates and an apache proxy back to localhost:6080 where hound
is listening.
I've combined removal of the old puppet bits here as the "-codesearch"
namespace was already being used.
Change-Id: I8c773b5ea6b87e8f7dfd8db2556626f7b2500473
These changes are squashed together to simplify applying them to config
management without zuul and ansible running one of these without the
others. We essentially need them all in place at the same time to
accurately reflect the post upgrade state.
We stop blocking /p/ in gerrit's apache vhost. /p/ is used for
dashboards.
We add a few java options that new gerrit sets by default.
We update the gerrit image in docker compose to 3.2.
We update zuul to use basic auth instead of digest auth when talking to
Gerrit.
Change-Id: I6ea38313544ce1ecbc4cfd914b1f33e77d0d2d03
Follow-on to Ia9579c7b3204b47d453fc51388265bf1867af20c, this also
matches the web-debug* log files
Change-Id: Ibabbfa3b01317528a75eeec17ea28168da57123a
This cuts out the bulk of the storage expense, but leaves us with the
regular logs for enhanced audit trails.
Change-Id: Ia9579c7b3204b47d453fc51388265bf1867af20c
This should help reduce the bulk of the review site backups
* launchpadlib cache has ~650,000 files which we don't need to track
* review_site/tmp has ~50,000 files
* review_site/cache is about 9gb
* review_site/index is optional to backup, but a) it's very unlikley
to be useful in a full restore situation; we'd have to re-create
them and b) things seem to come and go under this directory during
the backup, causing it to exit with an error status.
Change-Id: If7009cfcd5a3a07c07108149772cc8c1873bf277
This serverId value is used by notedb to identify the gerrit cluster
that notedb contents belong to. By default a random uuid is generated by
gerrit for this value. In order to avoid config management and gerrit
fighting over this value after we upgrade we set a value now.
This should be safe to land on 2.13 as old gerrit should ignore the
value.
Change-Id: I57c9b436a9d0d1dfe77eee907d50fc1dcda6ab12
bup is going crazy and filling the disk when making its backups. We
have moved this into the borg backup group and run some backups, so
rather than spending time debugging this, we are just going to disable
bup on the server.
Change-Id: I1daad4eb05f8222131dc84c12577dec924874466
Backups have been going well on ethercalc02, so add borg backup runs
to all backed-up servers. Port in some additional excludes for Zuul
and slightly modify the /var/ matching.
Change-Id: Ic3adfd162fa9bedd84402e3c25b5c1bebb21f3cb
As done for ORD, see Ic1e64a9f0de7bca2659404243d3a004b70888e89
Change-Id: I01a0d259abfed00745dd4cf5957ee3cfd14b9449
Depends-On: https://review.opendev.org/760493
We don't need a duplicate name, we need a mirror-int.ord.rax.opendev.org
name. I think this was copy pasting failure. Simple fix.
Change-Id: Ibe079da6d9393d30e8a664cc67355336d27105e4
Logs show that the nameservers are being notified via ipv6 and
rejecting the request:
nsd[18851]: notify for acme.opendev.org. \
from 2001:4800:7819:104:be76:4eff:fe04:43d0 refused, no acl matches.
Modify the nsd ACL to allow the ipv6 of the master to trigger updates.
This is important for the letsencrypt process, where we need the
acme.opendev.org domain updated in a timely fashion so that TXT
authentication works.
Change-Id: I785f9636dd05e15b8ffd211845f439be7e8344a3
This should create a certificate that also covers the -int hostnames,
which are records that point the the RAX internal network, rather than
public network.
Change-Id: Ic1e64a9f0de7bca2659404243d3a004b70888e89
Depends-On: https://review.opendev.org/759970
We seem to be under a similar attack to last time. The new apache filter
in front of gitea was implemented to be used if this happened again.
Switch to it.
Change-Id: Ib9ed3029dad7fc26cca209fece547a2a94d8da4a
We've disabled access to the local gerrit git mirrors at the /p/ prefix
previously as newer gerrit uses that path for something else. The next
step is to stop replicating to that location entirely.
Another reason for this is when we switch to notedb this local
replication will replicate everything then if we expose it we'd
potentially expose content we don't want to via git (rather than the
gerrit APIs).
Change-Id: I795466af3e1608eefe506ca56828327491f73c27
To catch up -- because this work is moving slowly ... the two backup
servers are currently the vexxhost and RAX ORD hosts. The vexxhost
node is deployed with Ansible on Bionic, but the old ORD host still
needs to be upgraded and moved out of puppet. Instead of dealing with
the unmaintained bup and getting it to work on the current LTS Focal,
we are doing an initial borg deployment with plans to switch to it
globally.
This adds the backup02.ca-ymq-1.vexxhost.opendev.org to the inventory
and borg-backup-server group, so it will be deployed as a borg backup
server (note, no hosts are backing up to it, yet).
To avoid the original bup roles matching, we restrict the
backup-server group to backup01.ca-ymq-1.vexxhost.opendev.org
explicitly.
Change-Id: Id30a2ffad75236fc23ed51b2c67d0028da988de5
This was a host used to transition to docker run nodepool builders. That
transition has been completed for nb01.opendev.org and nb02.opendev.org
and we don't need the third x86 builder.
Change-Id: I93c7fc9b24476527b451415e7c138cd17f3fdf9f
We keep port 2181 listening in zookeeper so that we can easily use the
zkshell tool to debug and navigate the database. But now that all zuul
and nodepool nodes are using tls we don't need to expose this insecure
port publicly.
Change-Id: I2a5ab8a9aee8f2739953e859ea52e6e9fd440790
This should only land after we've launched a new nb03.opendev.org
running with the new nodepool arm64 docker image. Once that happens and
we are happy with how it is running we can safely stop managing the
existing nb03.openstack.org server with puppet.
Change-Id: I8d224f9775bd461b43a2631897babd9e351ab6ae
This server is going to be our new arm64 nodepool-builder running on the
new arm64 docker images for nodepool.
Depends-On: https://review.opendev.org/750037
Change-Id: I3b46ff901eb92c7f09b79c22441c3f80bc6f9d15
It turns out you can't use "run_once" with the "free" strategy in
Ansible. It actually warns you about this, if you're looking in the
right place.
The existing run-puppet role calls two things with "run_once:", both
delegated to localhost -- cloning the ansible-role-puppet repo (so we
can include_role: puppet) and installing the puppet modules (via
install-ansible-roles role), which are copied from bridge to the
remote side and run by ansible-role-puppet.
With remote_puppet_else.yaml we are running all the puppet hosts at
once with the "free" strategy. This means that these two tasks, both
delegated to localhost (bridge) are actually running for every host.
install-ansible-roles does a git clone, and thus we often see one of
the clones bailing out with a git locking error, because the other
host is running similtaneously.
I8585a1af2dcc294c0e61fc45d9febb044e42151d tried to stop this with
"run_once:" -- but as noted because it's running under the "free"
strategy this is silently ignored.
To get around this, split out the two copying steps into a new role
"puppet-setup". To maintain the namespace, the "run-puppet" module is
renamed to "puppet-run". Before each call of (now) "puppet-run", make
sure we run "puppet-setup" just on localhost.
Remove the run_once and delegation on "install-ansible-roles"; because
this is now called from the playbook with localhost context.
Change-Id: I3b1cea5a25974f56ea9202e252af7b8420f4adc9
The zuul01.openstack.org server is not matching the Ansible backup
group, which specifies opendev.org. This means it is not backing up
to the "new" vexxhost server like everything else.
Change-Id: I07ac19f7cb5597950886c01806189e479e7a3724
The process of switching hosts to Ansible backups got a little
... backed up. I think the idea was that we would move these legacy
hosts to an all-Ansible configuration a little faster than what has
ended up happening.
In the mean time, we have done a better job of merging our environment
so puppet hosts are just a regular host that runs a puppet step rather
than separate entities.
So there is no problem running these roles on these older servers.
This will bring consistency to our backup story with everything being
managed from Ansible.
This will currently setup these hosts to backup to the only opendev
backup server in vexxhost. As a follow-on, we will add another
opendev backup host in another provider to provide dual-redundancy.
After that, we can remove the bup::site calls from these hosts and
retire the puppet-based backups.
Change-Id: Ieaea46d312056bf34992826d673356c56abfc87a
With I37dcce3a67477ad3b2c36f2fd3657af18bc25c40 we removed the
configuration managment of backups on the zuul server, which was
happening via puppet. So the server continues in it's last state, but
if we ever built a fresh server it would not have backups.
Add it into the Ansible backup group, and uncomment the backup-server
group to get a run and setup the Ansible-managed backups.
Change-Id: I0af6b7fedc2f8f5a7f214771918138f72d298325
The host is review-test.opendev.org, so hostvars for
review-test.openstack.org are not so much going to do anything.
It's easier if we just ssh as root from review to gerrit2
on review-test.
review-test needs to be in letsencrypt group and have a
handler.
We need to install mysql - it's on the existing review
servers but not in ansible, it's just left over from
puppet.
The db credentials are in /root/.gerrit_db.cnf
Change-Id: I90e3c9d1b398cc16fea9f7056cfb059c7140160e
I476674036748d284b9f51e30cc2ffc9650a50541 did not open port 3081 so
the proxy isn't visible. Also this group variable is a better place
to update the setting.
Change-Id: Iad0696221bb9a19852e4ce7cbe06b06ab360cf11
The OpenStack Infrastructure team has disbanded, replaced by the
OpenDev community and the OpenStack TaCT SIG. As OpenStack-specific
community infrastructure discussion now happens under TaCT's banner
and they use the openstack-discuss ML, redirect any future messages
for the openstack-infra ML there so we can close down the old list.
Change-Id: I0aea3b36668a92e47a6510880196589b94576cdf
This deploys graphite from the upstream container.
We override the statsd configuration to have it listen on ipv6.
Similarly we override the ngnix config to listen on ipv6, enable ssl,
forward port 80 to 443, block the /admin page (we don't use it).
For production we will just want to put some cinder storage in
/opt/graphite/storage on the production host and figure out how to
migrate the old stats. The is also a bit of cleanup that will follow,
because we half-converted grafana01.opendev.org -- so everything can't
be in the same group till that is gone.
Testing has been added to push some stats and ensure they are seen.
Change-Id: Ie843b3d90a72564ef90805f820c8abc61a71017d
This uses the Grafana container created with
Iddfafe852166fe95b3e433420e2e2a4a6380fc64 to run the
grafana.opendev.org service.
We retain the old model of an Apache reverse-proxy; it's well tested
and understood, it's much easier than trying to map all the SSL
termination/renewal/etc. into the Grafana container and we don't have
to convince ourselves the container is safe to be directly web-facing.
Otherwise this is a fairly straight forward deployment of the
container. As before, it uses the graph configuration kept in
project-config which is loaded in with grafyaml, which is included in
the container.
Once nice advantage is that it makes it quite easy to develop graphs
locally, using the container which can talk to the public graphite
instance. The documentation has been updated with a reference on how
to do this.
Change-Id: I0cc76d29b6911aecfebc71e5fdfe7cf4fcd071a4