263 Commits

Author SHA1 Message Date
Clark Boylan
613810dba1 Revert "Reduce gerrit heap limit to 44g"
This reverts commit 95d9b838140e44c9547ad1fa28bc88206823198c.

We've found that we run out of memory at 44g. Bump back up to 48g as
that should give us a bit more headroom.

Change-Id: I14a8f2b298aa1d3cb5c0829508ee137a6769675b
2020-12-09 15:26:43 -08:00
Clark Boylan
95d9b83814 Reduce gerrit heap limit to 44g
We had been setting this to 48GB on java 8, but recent gerrit service
issues indicate that this may be too large for our current system on
java 11. In particular it appears the non heap portions of the jvm may
be in the ~8GB range leaving only about 5-6GB of usable system memory
for other activities like web servers, backups, and garbage collection.

Reduce this to 44GB to increase headroom to see if that helps us. Java
11 is reported to be much more efficient at garbage collecting so
hopefully that makes up the difference between lower memory and where we
were on java 8. As a side note we could revert back to java 8 as another
option.

Change-Id: Ie326aad2a9895098b484924a26c9257cd009d89e
2020-12-08 07:31:53 -08:00
fungi.admin
2197f11a0f Merge "Omnibus Gerrit 3.2 changes" 2020-11-21 17:19:58 +00:00
Zuul
1b16dae681 Merge "Migrate codesearch site to container" 2020-11-19 22:26:12 +00:00
Ian Wienand
368466730c Migrate codesearch site to container
The hound project has undergone a small re-birth and moved to

 https://github.com/hound-search/hound

which has broken our deployment.  We've talked about leaving
codesearch up to gitea, but it's not quite there yet.  There seems to
be no point working on the puppet now.

This builds a container than runs houndd.  It's an opendev specific
container; the config is pulled from project-config directly.

There's some custom scripts that drive things.  Some points for
reviewers:

 - update-hound-config.sh uses "create-hound-config" (which is in
   jeepyb for historical reasons) to generate the config file.  It
   grabs the latest projects.yaml from project-config and exits with a
   return code to indicate if things changed.

 - when the container starts, it runs update-hound-config.sh to
   populate the initial config.  There is a testing environment flag
   and small config so it doesn't have to clone the entire opendev for
   functional testing.

 - it runs under supervisord so we can restart the daemon when
   projects are updated.  Unlike earlier versions that didn't start
   listening till indexing was done, this version now puts up a "Hound
   is not ready yet" message when while it is working; so we can drop
   all the magic we were doing to probe if hound is listening via
   netstat and making Apache redirect to a status page.

 - resync-hound.sh is run from an external cron job daily, and does
   this update and restart check.  Since it only reloads if changes
   are made, this should be relatively rare anyway.

 - There is a PR to monitor the config file
   (https://github.com/hound-search/hound/pull/357) which would mean
   the restart is unnecessary.  This would be good in the near and we
   could remove the cron job.

 - playbooks/roles/codesearch is unexciting and deploys the container,
   certificates and an apache proxy back to localhost:6080 where hound
   is listening.

I've combined removal of the old puppet bits here as the "-codesearch"
namespace was already being used.

Change-Id: I8c773b5ea6b87e8f7dfd8db2556626f7b2500473
2020-11-20 07:41:12 +11:00
Clark Boylan
57f9e54ad8 Omnibus Gerrit 3.2 changes
These changes are squashed together to simplify applying them to config
management without zuul and ansible running one of these without the
others. We essentially need them all in place at the same time to
accurately reflect the post upgrade state.

We stop blocking /p/ in gerrit's apache vhost. /p/ is used for
dashboards.

We add a few java options that new gerrit sets by default.

We update the gerrit image in docker compose to 3.2.

We update zuul to use basic auth instead of digest auth when talking to
Gerrit.

Change-Id: I6ea38313544ce1ecbc4cfd914b1f33e77d0d2d03
2020-11-17 16:04:56 -08:00
Zuul
2c7591c318 Merge "Set gerrit.serverId in gerrit.config" 2020-11-17 21:22:53 +00:00
Ian Wienand
c16501af8a zuul backup : expand debug log match
Follow-on to Ia9579c7b3204b47d453fc51388265bf1867af20c, this also
matches the web-debug* log files

Change-Id: Ibabbfa3b01317528a75eeec17ea28168da57123a
2020-11-13 14:34:06 +11:00
Ian Wienand
dbff6071b1 backup: skip zuul debug logs for backup
This cuts out the bulk of the storage expense, but leaves us with the
regular logs for enhanced audit trails.

Change-Id: Ia9579c7b3204b47d453fc51388265bf1867af20c
2020-11-12 12:11:39 +11:00
Ian Wienand
6bcfe05742 review: trim backups
This should help reduce the bulk of the review site backups

 * launchpadlib cache has ~650,000 files which we don't need to track
 * review_site/tmp has ~50,000 files
 * review_site/cache is about 9gb
 * review_site/index is optional to backup, but a) it's very unlikley
   to be useful in a full restore situation; we'd have to re-create
   them and b) things seem to come and go under this directory during
   the backup, causing it to exit with an error status.

Change-Id: If7009cfcd5a3a07c07108149772cc8c1873bf277
2020-11-11 23:36:11 +00:00
Clark Boylan
b9b1cba959 Set gerrit.serverId in gerrit.config
This serverId value is used by notedb to identify the gerrit cluster
that notedb contents belong to. By default a random uuid is generated by
gerrit for this value. In order to avoid config management and gerrit
fighting over this value after we upgrade we set a value now.

This should be safe to land on 2.13 as old gerrit should ignore the
value.

Change-Id: I57c9b436a9d0d1dfe77eee907d50fc1dcda6ab12
2020-11-10 10:30:58 -08:00
Ian Wienand
b05a98440a Remove etherpad from bup backup
bup is going crazy and filling the disk when making its backups.  We
have moved this into the borg backup group and run some backups, so
rather than spending time debugging this, we are just going to disable
bup on the server.

Change-Id: I1daad4eb05f8222131dc84c12577dec924874466
2020-11-10 13:52:03 +11:00
Zuul
9ff95a5f00 Merge "etherpad: ignore live db for borg backups" 2020-11-10 00:11:22 +00:00
Ian Wienand
b26622ad12 etherpad: ignore live db for borg backups
Change-Id: Ie7f7e189720e68ec0b07a727be0f5752da20566d
2020-11-10 10:11:24 +11:00
Ian Wienand
d533e89089 Add all backup hosts to borg backups
Backups have been going well on ethercalc02, so add borg backup runs
to all backed-up servers.  Port in some additional excludes for Zuul
and slightly modify the /var/ matching.

Change-Id: Ic3adfd162fa9bedd84402e3c25b5c1bebb21f3cb
2020-11-09 17:23:22 +11:00
Ian Wienand
3568b76c3c Add * match to grafana.opendev.org
This wasn't matching grafana01

Change-Id: I930a6d1428d8becd29d15fdb53d26b0c186b79fd
2020-11-05 11:35:57 +11:00
Zuul
1dc940c74f Merge "RAX DFW/IAD : add internal mirror DNS to cert" 2020-11-04 03:28:57 +00:00
Ian Wienand
676c5dad44 Add borg backup server in RAX ORD
This is our second backup server for borg, hosted in RAX/ORD.

Change-Id: I2c896345e497067ce12863bdb1dda8ce467e2243
2020-10-30 16:39:25 +11:00
Ian Wienand
9a0dfc3004 RAX DFW/IAD : add internal mirror DNS to cert
As done for ORD, see Ic1e64a9f0de7bca2659404243d3a004b70888e89

Change-Id: I01a0d259abfed00745dd4cf5957ee3cfd14b9449
Depends-On: https://review.opendev.org/760493
2020-10-30 15:02:51 +11:00
Ian Wienand
c49ece9204 Cleanup grafana.openstack.org
The opendev.org server is in production, cleanup the old puppet-based
host.

Change-Id: I6db3ce929226a23b96234b52ece8b17f4c6a326a
2020-10-29 07:59:42 +11:00
Clark Boylan
c38f27c4bc This updates LE config for the ord mirror to the correct name
We don't need a duplicate name, we need a mirror-int.ord.rax.opendev.org
name. I think this was copy pasting failure. Simple fix.

Change-Id: Ibe079da6d9393d30e8a664cc67355336d27105e4
2020-10-28 09:59:17 -07:00
Ian Wienand
0746dc187b nameserver: Allow master server to notify via ipv6
Logs show that the nameservers are being notified via ipv6 and
rejecting the request:

  nsd[18851]: notify for acme.opendev.org. \
   from 2001:4800:7819:104:be76:4eff:fe04:43d0 refused, no acl matches.

Modify the nsd ACL to allow the ipv6 of the master to trigger updates.
This is important for the letsencrypt process, where we need the
acme.opendev.org domain updated in a timely fashion so that TXT
authentication works.

Change-Id: I785f9636dd05e15b8ffd211845f439be7e8344a3
2020-10-28 09:26:14 +00:00
Ian Wienand
7e2641f672 Generate internal certs for RAX ORD mirror
This should create a certificate that also covers the -int hostnames,
which are records that point the the RAX internal network, rather than
public network.

Change-Id: Ic1e64a9f0de7bca2659404243d3a004b70888e89
Depends-On: https://review.opendev.org/759970
2020-10-28 10:04:45 +11:00
Clark Boylan
ccc91ba9c9 Switch to filtered apache backends
We seem to be under a similar attack to last time. The new apache filter
in front of gitea was implemented to be used if this happened again.
Switch to it.

Change-Id: Ib9ed3029dad7fc26cca209fece547a2a94d8da4a
2020-10-14 08:55:21 -07:00
Zuul
3796649b66 Merge "borg backup : add ethercalc02" 2020-10-12 22:22:53 +00:00
Ian Wienand
27f67136f7 borg backup : add ethercalc02
Add ethercalc as the first borg backup client.  We will monitor this
as it roles into production.

Change-Id: I1ac71e92a8e5c779cd98af16ee4958877c6162ce
2020-10-12 16:30:46 +11:00
Clark Boylan
cae58690dc Stop replicating to local git mirror on gerrit
We've disabled access to the local gerrit git mirrors at the /p/ prefix
previously as newer gerrit uses that path for something else. The next
step is to stop replicating to that location entirely.

Another reason for this is when we switch to notedb this local
replication will replicate everything then if we expose it we'd
potentially expose content we don't want to via git (rather than the
gerrit APIs).

Change-Id: I795466af3e1608eefe506ca56828327491f73c27
2020-10-09 10:39:30 -07:00
Ian Wienand
0d0f8ffe14 Add initial borg backup server
To catch up -- because this work is moving slowly ... the two backup
servers are currently the vexxhost and RAX ORD hosts.  The vexxhost
node is deployed with Ansible on Bionic, but the old ORD host still
needs to be upgraded and moved out of puppet.  Instead of dealing with
the unmaintained bup and getting it to work on the current LTS Focal,
we are doing an initial borg deployment with plans to switch to it
globally.

This adds the backup02.ca-ymq-1.vexxhost.opendev.org to the inventory
and borg-backup-server group, so it will be deployed as a borg backup
server (note, no hosts are backing up to it, yet).

To avoid the original bup roles matching, we restrict the
backup-server group to backup01.ca-ymq-1.vexxhost.opendev.org
explicitly.

Change-Id: Id30a2ffad75236fc23ed51b2c67d0028da988de5
2020-10-08 11:54:27 +11:00
Ian Wienand
1b4006757a Cleanup graphite01
Server is replaced with graphite02.opendev.org

Change-Id: Ie6099e935a6a7e10c818d1d3003e44bca11dd13a
2020-09-30 11:55:24 +10:00
Clark Boylan
9fdbd56d16 Remove nb04
This was a host used to transition to docker run nodepool builders. That
transition has been completed for nb01.opendev.org and nb02.opendev.org
and we don't need the third x86 builder.

Change-Id: I93c7fc9b24476527b451415e7c138cd17f3fdf9f
2020-09-18 11:12:04 -07:00
Zuul
98370830a3 Merge "Remove mirror01.regionone.linaro-us.opendev.org" 2020-09-18 04:46:09 +00:00
Clark Boylan
1bff2f9fca Block port 2181 on zookeeper hosts
We keep port 2181 listening in zookeeper so that we can easily use the
zkshell tool to debug and navigate the database. But now that all zuul
and nodepool nodes are using tls we don't need to expose this insecure
port publicly.

Change-Id: I2a5ab8a9aee8f2739953e859ea52e6e9fd440790
2020-09-09 15:31:47 -07:00
Clark Boylan
1ea83138ef Remove nodepool builder puppetry and nb03.openstack.org
This should only land after we've launched a new nb03.opendev.org
running with the new nodepool arm64 docker image. Once that happens and
we are happy with how it is running we  can safely stop managing the
existing nb03.openstack.org server with puppet.

Change-Id: I8d224f9775bd461b43a2631897babd9e351ab6ae
2020-09-09 15:09:43 -07:00
Clark Boylan
ebd9c4c59e Add nb03.opendev.org
This server is going to be our new arm64 nodepool-builder running on the
new arm64 docker images for nodepool.

Depends-On: https://review.opendev.org/750037
Change-Id: I3b46ff901eb92c7f09b79c22441c3f80bc6f9d15
2020-09-04 13:22:32 -07:00
Ian Wienand
19ea4603f4 puppet: don't run module install steps multiple times
It turns out you can't use "run_once" with the "free" strategy in
Ansible.  It actually warns you about this, if you're looking in the
right place.

The existing run-puppet role calls two things with "run_once:", both
delegated to localhost -- cloning the ansible-role-puppet repo (so we
can include_role: puppet) and installing the puppet modules (via
install-ansible-roles role), which are copied from bridge to the
remote side and run by ansible-role-puppet.

With remote_puppet_else.yaml we are running all the puppet hosts at
once with the "free" strategy.  This means that these two tasks, both
delegated to localhost (bridge) are actually running for every host.
install-ansible-roles does a git clone, and thus we often see one of
the clones bailing out with a git locking error, because the other
host is running similtaneously.
I8585a1af2dcc294c0e61fc45d9febb044e42151d tried to stop this with
"run_once:" -- but as noted because it's running under the "free"
strategy this is silently ignored.

To get around this, split out the two copying steps into a new role
"puppet-setup".  To maintain the namespace, the "run-puppet" module is
renamed to "puppet-run".  Before each call of (now) "puppet-run", make
sure we run "puppet-setup" just on localhost.

Remove the run_once and delegation on "install-ansible-roles"; because
this is now called from the playbook with localhost context.

Change-Id: I3b1cea5a25974f56ea9202e252af7b8420f4adc9
2020-09-03 09:23:05 +10:00
Ian Wienand
c55a548e71 mirror02.regionone.linaro.us : add missing LE file
Change-Id: Ia052f85a92b8a52d7e1896c24ac54dd9eb1620e0
2020-08-25 16:39:11 +10:00
Ian Wienand
600c9e78d4 Remove mirror01.regionone.linaro-us.opendev.org
Replaced with 02 mirror

Change-Id: I63114be35836f5ddb204e8c0ca5a1e10b056a4b0
2020-08-25 14:43:07 +10:00
Zuul
959473141b Merge "Backup inventory - match zuul01.openstack.org" 2020-08-04 01:48:27 +00:00
Zuul
b4c95d08b9 Merge "Update host_vars and sync-to-review-test playbook" 2020-07-28 14:09:11 +00:00
Ian Wienand
78eadcb783 Backup inventory - match zuul01.openstack.org
The zuul01.openstack.org server is not matching the Ansible backup
group, which specifies opendev.org.  This means it is not backing up
to the "new" vexxhost server like everything else.

Change-Id: I07ac19f7cb5597950886c01806189e479e7a3724
2020-07-28 13:06:05 +10:00
Zuul
1800b01bad Merge "Forward openstack-infra ML to openstack-discuss" 2020-07-15 15:22:14 +00:00
Ian Wienand
cacdb7f573 Backup all hosts with Ansible
The process of switching hosts to Ansible backups got a little
... backed up.  I think the idea was that we would move these legacy
hosts to an all-Ansible configuration a little faster than what has
ended up happening.

In the mean time, we have done a better job of merging our environment
so puppet hosts are just a regular host that runs a puppet step rather
than separate entities.

So there is no problem running these roles on these older servers.
This will bring consistency to our backup story with everything being
managed from Ansible.

This will currently setup these hosts to backup to the only opendev
backup server in vexxhost.  As a follow-on, we will add another
opendev backup host in another provider to provide dual-redundancy.
After that, we can remove the bup::site calls from these hosts and
retire the puppet-based backups.

Change-Id: Ieaea46d312056bf34992826d673356c56abfc87a
2020-07-15 08:33:44 +10:00
Ian Wienand
999a409530 Add Zuul to backups group
With I37dcce3a67477ad3b2c36f2fd3657af18bc25c40 we removed the
configuration managment of backups on the zuul server, which was
happening via puppet.  So the server continues in it's last state, but
if we ever built a fresh server it would not have backups.

Add it into the Ansible backup group, and uncomment the backup-server
group to get a run and setup the Ansible-managed backups.

Change-Id: I0af6b7fedc2f8f5a7f214771918138f72d298325
2020-07-14 08:35:57 +10:00
Monty Taylor
4aa28fee13 Update host_vars and sync-to-review-test playbook
The host is review-test.opendev.org, so hostvars for
review-test.openstack.org are not so much going to do anything.

It's easier if we just ssh as root from review to gerrit2
on review-test.

review-test needs to be in letsencrypt group and have a
handler.

We need to install mysql - it's on the existing review
servers but not in ansible, it's just left over from
puppet.

The db credentials are in /root/.gerrit_db.cnf

Change-Id: I90e3c9d1b398cc16fea9f7056cfb059c7140160e
2020-07-12 08:09:46 -05:00
Ian Wienand
5f2e6c43a8 gitea: open port 3081
I476674036748d284b9f51e30cc2ffc9650a50541 did not open port 3081 so
the proxy isn't visible.  Also this group variable is a better place
to update the setting.

Change-Id: Iad0696221bb9a19852e4ce7cbe06b06ab360cf11
2020-07-08 13:54:44 +10:00
Zuul
c72451c466 Merge "Don't install the track-upstream cron on review-test" 2020-07-08 00:26:25 +00:00
Monty Taylor
4d26d9cb40 Don't install the track-upstream cron on review-test
This is just spawning containers that never die.

Change-Id: I1f5215c6e60ac59d1eb224bef9032785938dfc70
2020-07-07 14:40:24 -05:00
Jeremy Stanley
c351382293 Forward openstack-infra ML to openstack-discuss
The OpenStack Infrastructure team has disbanded, replaced by the
OpenDev community and the OpenStack TaCT SIG. As OpenStack-specific
community infrastructure discussion now happens under TaCT's banner
and they use the openstack-discuss ML, redirect any future messages
for the openstack-infra ML there so we can close down the old list.

Change-Id: I0aea3b36668a92e47a6510880196589b94576cdf
2020-07-02 21:27:31 +00:00
Ian Wienand
185797a0e5 Graphite container deployment
This deploys graphite from the upstream container.

We override the statsd configuration to have it listen on ipv6.
Similarly we override the ngnix config to listen on ipv6, enable ssl,
forward port 80 to 443, block the /admin page (we don't use it).

For production we will just want to put some cinder storage in
/opt/graphite/storage on the production host and figure out how to
migrate the old stats.  The is also a bit of cleanup that will follow,
because we half-converted grafana01.opendev.org -- so everything can't
be in the same group till that is gone.

Testing has been added to push some stats and ensure they are seen.

Change-Id: Ie843b3d90a72564ef90805f820c8abc61a71017d
2020-07-03 07:17:28 +10:00
Ian Wienand
b146181174 Grafana container deployment
This uses the Grafana container created with
Iddfafe852166fe95b3e433420e2e2a4a6380fc64 to run the
grafana.opendev.org service.

We retain the old model of an Apache reverse-proxy; it's well tested
and understood, it's much easier than trying to map all the SSL
termination/renewal/etc. into the Grafana container and we don't have
to convince ourselves the container is safe to be directly web-facing.

Otherwise this is a fairly straight forward deployment of the
container.  As before, it uses the graph configuration kept in
project-config which is loaded in with grafyaml, which is included in
the container.

Once nice advantage is that it makes it quite easy to develop graphs
locally, using the container which can talk to the public graphite
instance.  The documentation has been updated with a reference on how
to do this.

Change-Id: I0cc76d29b6911aecfebc71e5fdfe7cf4fcd071a4
2020-07-03 07:17:22 +10:00