Update the backup instructions for some recent changes. Make a note
of the streaming backup method, discuss some caveats with append-only
mode and discuss the pruning scripts and when to run
(c.f. I9559bb8aeeef06b95fb9e172a2c5bfb5be5b480e,
I250d84c4a9f707e63fef6f70cfdcc1fb7807d3a7).
Change-Id: Idb04ebfa5666cd3c20bc0132683d187e705da3f1
Add the FUSE dependencies for our hosts backed up with borg, along
with a small script to make mounting the backups easier. This is the
best way to recover something quickly in what is sure to be a
stressful situation.
Documentation and testing is updated.
Change-Id: I1f409b2df952281deedff2ff8f09e3132a2aff08
Our Gerrit admins follow this model of access management now, in
order to shield Administrators permission from external identity
provider risks.
Change-Id: I3070c28c26548d364da38d366bfa2ac8b2fb4668
This adds roles to implement backup with borg [1].
Our current tool "bup" has no Python 3 support and is not packaged for
Ubuntu Focal. This means it is effectively end-of-life. borg fits
our model of servers backing themselves up to a central location, is
well documented and seems well supported. It also has the clarkb seal
of approval :)
As mentioned, borg works in the same manner as bup by doing an
efficient back up over ssh to a remote server. The core of these
roles are the same as the bup based ones; in terms of creating a
separate user for each host and deploying keys and ssh config.
This chooses to install borg in a virtualenv on /opt. This was chosen
for a number of reasons; firstly reading the history of borg there
have been incompatible updates (although they provide a tool to update
repository formats); it seems important that we both pin the version
we are using and keep clients and server in sync. Since we have a
hetrogenous distribution collection we don't want to rely on the
packaged tools which may differ. I don't feel like this is a great
application for a container; we actually don't want it that isolated
from the base system because it's goal is to read and copy it offsite
with as little chance of things going wrong as possible.
Borg has a lot of support for encrypting the data at rest in various
ways. However, that introduces the possibility we could lose both the
key and the backup data. Really the only thing stopping this is key
management, and if we want to go down this path we can do it as a
follow-on.
The remote end server is configured via ssh command rules to run in
append-only mode. This means a misbehaving client can't delete its
old backups. In theory we can prune backups on the server side --
something we could not do with bup. The documentation has been
updated but is vague on this part; I think we should get some hosts in
operation, see how the de-duplication is working out and then decide
how we want to mange things long term.
Testing is added; a focal and bionic host both run a full backup of
themselves to the backup server. Pretty cool, the logs are in
/var/log/borg-backup-<host>.log.
No hosts are currently in the borg groups, so this can be applied
without affecting production. I'd suggest the next steps are to bring
up a borg-based backup server and put a few hosts into this. After
running for a while, we can add all hosts, and then deprecate the
current bup-based backup server in vexxhost and replace that with a
borg-based one; giving us dual offsite backups.
[1] https://borgbackup.readthedocs.io/en/stable/
Change-Id: I2a125f2fac11d8e3a3279eb7fa7adb33a3acaa4e
We've got a section on using the emergency file and disabled ansible
group. Add info about the special DISABLE-ANSIBLE file there to help
make that info easier to find.
Change-Id: I2e750b9b87ca7a4f800d3ac161a195d49543a7da
Make inventory/service for service-specific things, including the
groups.yaml group definitions, and inventory/base for hostvars
related to the base system, including the list of hosts.
Move the exisitng host_vars into inventory/service, since most of
them are likely service-specific. Move group_vars/all.yaml into
base/group_vars as almost all of it is related to base things,
with the execption of the gerrit public key.
A followup patch will move host-specific values into equivilent
files in inventory/base.
This should let us override hostvars in gate jobs. It should also
allow us to do better file matchers - and to be able to organize
our playbooks move if we want to.
Depends-On: https://review.opendev.org/731583
Change-Id: Iddf57b5be47c2e9de16b83a1bc83bee25db995cf
We've got some old out of date docs in some places. This isn't even
a full reworking, but at least tries to remove some of the more
egregiously wrong things.
Change-Id: I9033acb9572e1ce1b3e4426564b92706a4385dcb
We had the clouds split from back when we used the openstack
dynamic inventory plugin. We don't use that anymore, so we don't
need these to be split. Any other usage we have directly references
a cloud.
Change-Id: I5d95bf910fb8e2cbca64f92c6ad4acd3aaeed1a3
With the move from OpenStack governance to our own OpenDev team, we
should also move to use the #opendev IRC channel in preference to
the #openstack-infra channel which will remain in use for OpenStack
specific discussions.
Update the references in our docs accordingly.
Change-Id: I448704f5d2664fd233a69a2ad12578ca24d9878a
This introduces two new roles for managing the backup-server and hosts
that we wish to back up.
Firstly the "backup" role runs on hosts we wish to backup. This
generates and configures a separate ssh key for running bup and
installs the appropriate cron job to run the backup daily.
The "backup-server" job runs on the backup server (or, indeed
servers). It creates users for each backup host, accepts the remote
keys mentioned above and initalises bup. It is then ready to receive
backups from the remote hosts.
This eliminates a fairly long-standing requirement for manual setup of
the backup server users and keys; this section is removed from the
documentation.
testinfra coverage is added.
Change-Id: I9bf74df351e056791ed817180436617048224d2c
The launch script is referring to the wrong path for the emergency
inventory. Also correct the references in the sysadmin guide and
update the example for using it.
Change-Id: I80bdbd440ec451bcd6fb1a3eb552ffda32407c44
Reorder some of the commands used to set up and configure the bup
user on backup servers so the process is more straightforward and
requires fewer mental context switches.
Change-Id: I73cb80a04b8b5a74bb0857b4c8b6fb09030d6306
In sphinx, we have a :cgit_file: directive that makes links to files.
Thing is - we're not using cgit anymore. So just rename it to git_file.
Change-Id: I80aca5fb3cc84281e29843944fea33e6f4d9fe6f
The zuul and zuulv3 docs need to be merged, but that seemed like
too much for this. Also, the 3rd party CI doc is out of date, but
in this patch only removed sections that linked to docs or files
that don't exist anymore.
Change-Id: Ie5497edd762d2146165608f3227b0bac88a913df
This change describes the shared github administrator account.
This is inspired by I0c61f192a6b5164af7babde5c99e5ee2b77a652c. As
described there, this allows for admins to have private accounts in
the organisation, but requires that 2FA be turned on. If people wish
to keep this as a single account which they do "real" work with
(commits, etc) that is probably OK, but add a note that you'll end up
with a lot of mostly irrelevant stuff in your feeds.
Change-Id: Ic408250571133796b4b4639715fe8d01f91898f2
Add some details about how we integrate a new cloud into the
ecosystem. I feel like this is an appropriate level of detail given
we're dealing with clueful admins who just need a rough guide on what
to do and can fill in the gaps.
Fix up the formatting a bit while we're here.
Change-Id: Iba3440e67ab798d5018b9dffb835601bb5c0c6c7
Fix indents of some pages, the wrong indent let to gray bars besides
them.
Also, fix a typo and add some markup.
Change-Id: I6e7126ef7b782b376efcc7c6d69c6de9a504ddb5
We have a bunch of this handled now in ansible, so remove the old stuff.
Remove puppetmaster group management files. It's confusing for there to
be two files. Remove the old one.
Remove mqtt config. This isn't really a thing currently, and we're
eyeing running things from zuul anyway, so no need to port to ansible.
Change-Id: I8b64d21eadcc4a08bd5e5440fc5f756ae5bcd46b
Now that we've got base server stuff rewritten in ansible, remove the
old puppet versions.
Depends-On: https://review.openstack.org/588326
Change-Id: I5c82fe6fd25b9ddaa77747db377ffa7e8bf23c7b
This modernises the openstack-infra documentation by switching to
openstackdocstheme. Update dependencies as required.
To remove non-relevant stuff from conf.py, I have just taken the demo
file from openstackdocstheme and lightly modified it.
It seems later sphinx has included it's own ":file:" role which now
conflicts. Change it it ":cgit_file:" in our documentation. Remove
the custom header template which no longer applies. Add the
post-2.0-pbr sphinx-based warning-as-error, which fixes the original
problem that I actually noticed that errors could slip through the
gate tests :)
Change-Id: Ic7bec57b971bb4c75fc839e7269d1f69a576b85c
With the switch to Zuul v3, we need to resolve some configuration
catch-22s where project names and related in-repository job
definitions can't happen without a complex multi-stage removal and
reintroduction process to get it through speculative testing
successfully. For now, just punt and use monolithic changes
bypassing CI in code review. As an up side, the Ansible automation
of this process coupled with Zuul v3's increased resilience to
on-the-fly configuration changes means we can skip stopping/starting
it now and significantly simplify the process.
Since we're here, correct the section heading level for
"Force-Merging a Change" in the sysadmin document.
Change-Id: I335c23abd0b5706f43bbea2dd8cfffa4280dd5db
Migrate backups to new backup01.ord.rax.ci.openstack.org
We decided to start fresh backups on the new server, so this is ready
to go. I have performed an initial backup on each server so it has
accepted the host key of the new server and been tested (I also fixed
up review-dev.o.o, which was rebuilt but keys not updated ... todo:
add this to puppet, but since it changes so infrequently not high
priority).
Change-Id: I0872f9fcf4a334d32f632b3cb04801deefab4fd1
We usually want to do these steps to avoid volume outages when
rackspace is doing updates.
Change-Id: Ie5de97484dddb9136c240baf46724646e39df67e
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
This adds the now required bup init command to the server to be backed
up. Also remove now gone HPCloud backup server and fix quotes around
command for catting public ssh key.
Change-Id: I607a7c079b16d7f1e94d6b0888cd6e302a04f68f
As discussed during the "Launch Node, Ansible and Puppet" summit
session in Austin, we're making things unnecessarily hard on
ourselves by insisting on having multiple servers in our inventory
with the same name. In order to make server addition and replacement
automation simpler, start using an ordinal suffix on server short
names to differentiate them (we can still easily rely on DNS for
their non-numbered convenience names).
Change-Id: I040a5c3b5e1abc50c3e4676bcab0bf4eaa550f4b
Sometimes we want to extend a logical volume to the entire size of the
volume group. The command to do this is quite strange and I am tried of
googling it. It is so documented.
Change-Id: I600ceb41c57e27eaaf68a1643be848cd331130a5
We already have a dynamic system for managing static group management.
Use it for the disabled group so that the rules for managing the members
are not different.
Also, update the disabled list to match reality.
Also, Update docs because hosts are no longer groups
The upstream OpenStack Inventory in Ansible was fixed to no longer
return each cloud host as its own group unless there are duplicates for
the host in question. This means it's no longer the right thing to do
to put hosts into disabled:children - disabled is just fine.
Change-Id: I95c83ed64801db15ad99a14547895f3520356f99