Change-Id: Ic888880741fc4095ef53843a2d483c06d6101a82
12 KiB
- title
-
Open Infrastructure Technical Overview
Open Infrastructure Technical Overview
The OpenDev system administration team strives to run the services behind the OpenDev Collaboratory as an open source project; we term this open infrastructure.
Our infrastructure is code and contributions to it are handled just like the rest of OpenDev. This means that anyone can contribute to the installation and long-running maintenance of systems without shell access, and anyone who is interested can provide feedback and collaborate on code reviews. There are no permissions or special privileges required to contribute to the OpenDev infrastructure project.
Below is a short guide to the major pieces of the project. Some knowledge of Zuul job configuration, Ansible, interaction with the Gerrit code-review system and general Linux administration are assumed; however expertise is not required.
Operating environment
The OpenDev production systems run in resources (compute, network, storage) provided by donations from companies who support the project.
Our standard production system is based on the latest Ubuntu LTS release.
Production systems are deployed by Ansible. Most production applications run from containers; some are custom built and others we use unmodified from upstream sources.
Zuul handles the testing and deployment of all changes. Current trends would refer to this as a gitops model -- all production changes are ultimately driven by a change proposed to the code-review system. This means we do not have bespoke production systems and any modifications we make are reviewed by peers and logged with change history.
We have a bastion host, or bridge, which is a static host with permissions to deploy to the production systems. Zuul will run Ansible on the production systems via this host to deploy new changes into production.
Getting started - CI
The configuration of every system operated by the OpenDev sysadmins
is managed by Ansible and driven by continuous integration and
deployment by Zuul. This is almost exclusively driven by code kept in
the system-config
repository, which can be browsed at:
All system configuration should be encoded in that repository so that anyone may propose a change in the running configuration to Gerrit.
Any change to the OpenDev infrastructure system is first proposed as
a review to this repository at review.opendev.org
. The
current open reviews can be seen at
Zuul will first run CI on all incoming changes. Each service
generally has its own CI job that runs when relevant files
(configuration, Ansible roles, playbooks, etc.) are updated. These are
generally called system-config-run-<service>
; Zuul
will post a comment when the change has been tested, or you can see
in-flight testing at the status page
These jobs are crafted in a way that they replicate production as much as possible. Reading the job definitions in in :git_file:`zuul.d/system-config-run.yaml` will give you a feel for the hosts that are set up with each job. When you view the job results in the Zuul UI, you will see many logs collected from a number of hosts that simulate the production environment. This has all the information you generally need to debug problems, but the best place to start is with the artifacts tab, which has some curated links to useful overviews.
One of the job artifacts is the ARA report. This is a graphical view of the nested Ansible run on the (ephemeral) bastion host against the (ephemeral) production-test nodes. This is generally the first stop for finding deployment issues.
Another artifact is the testinfra results
. Testinfra allows us to
define unit-test-like behaviour to test functionality such as service
and API status, correct deployment of users and files and other
interesting details. Failures here would indicate the the deployment
steps worked, but some part of the operation of that system is not as we
expect. The testinfra
code driving this is kept in :git_file:`testinfra` and test files are named
for the service they test.
Finally there is a screenshots
artifact, which is a link
to a directory that some tests populate with image files. Tests that are
bringing up interactive services will use a headless browser to take
shots of important pages to verify correct operation.
The logs tab has links the the raw logs; this collects much more
detail such as syslog
, Apache logs, database dumps, etc.
Once you have identified the general problem from the above steps, these
logs provide the in-depth details for further analysis.
Playbooks and roles
The starting point for all services is generally the playbooks and roles kept in :git_file:`playbooks/. Most playbooks are named service-<name>.yaml` and will indicate from their naming which production areas they drive.
During testing, these same playbooks are run against the test nodes. You can note that the testing hosts are given names that match the group configuration in the jobs defined in :git_file:`zuul.d/system-config-run.yaml`.
These playbooks are usually small and they call out to roles where most of the work is done. Roles are kept in :git_file:`playbooks/roles/`. These roles are written to be as generic as possible, but they are not expected to be used outside the OpenDev production deployment system.
These playbooks and roles are the same for CI and deployment.
Hosts and variables
The playbooks above run on groups of hosts which are defined in :git_file:`inventory/service/groups.yaml`.
The production hosts are kept in an inventory at :git_file:`inventory/base/hosts.yaml`. In CI, the inventory is generated by Zuul (as it is allocating ephemeral nodes from the testing pool).
Public production and testing variables are kept under :git_file:`inventory/. The one difference between CI and production is *secrets* such as API keys, tokens and passwords; in production the *nested* Ansible will populate these variables for the deployment directly from values stored on the bastion host. In CI, dummy values should be populated into the templates under :git_file:`playbooks/zuul/templates/.
Production secrets are currently managed manually by OpenDev administrators on the bastion host.
Deployment
After review and approval of a change, Zuul will perform final gate testing and merge the change on your behalf.
Just as uploading a new change triggers Zuul to run CI tests in the check pipeline, and approving a change triggers Zuul to run gate tests and merge in the gate pipeline, the merge of a change triggers Zuul to run the deployment jobs in the deploy pipeline.
These jobs are named infra-prod-<service>
and run
the same playbooks and roles as in the CI system, except against the
production services. Zuul will deploy the merged changes to the bastion
host, and then trigger the bastion host to run a nested Ansible
deployment against the production host.
Since the production run logs may leak sensitive information, they are not published openly. You can add a GPG public key to :git_file:`playbooks/zuul/roles/encrypt-logs/defaults/main.yaml and then ensure the infra-prod-<service> production has your name in its encrypt_logs_job_recipients` variable. Once approved and committed, you will then be able to view the encrypted production log output provided via the Zuul build page for the production run.
Containers
Most services are containerised. When looking at the
system-config-run-*
and infra-prod-*
jobs you
may see dependencies on container build/upload/promote jobs; this
indicates we have jobs that build a bespoke container for this
environment.
The base Dockerfile
for these containers is found under
:git_file:`docker/. Most are straight forward, but some of the more
complicated services have multiple steps and layers. Any changes to the
Dockerfile` will be tested as
usual, and when approved the containers will be rebuilt, published and
pulled onto the production systems automatically.
Certificates
We provision SSL certificates from LetsEncrypt; see letsencrypt
.
DNS
DNS for opendev.org
(and some other domains) is also
handled through the review system; see the https://opendev.org/opendev/zone-opendev.org/
project.
Backups
Any host in the backup
group will have backups to two
geographically distinct locations setup by the deployment
infrastructure. See the borg-backup
role for details on
including or excluding various data.
Remote access
Hosts are only configured by Ansible, but they can be setup for interactive access if required.
Add your public key to :git_file:`inventory/base/group_vars/all.yaml and include a stanza like this in your server host_vars`:
extra_users:
- your_user_name
See ssh-access
for
details on keys.
Documentation
Each service should have an RST file with documentation about the server and services in :git_file:`doc/source/`.
Submitting Changes
If you are not familiar with submitting changes to Gerrit, you can start with any of the various developer guides such as :
https://docs.opendev.org/opendev/infra-manual/latest/gettingstarted.html
https://docs.openstack.org/doc-contrib-guide/quickstart/first-timers.html
https://docs.opendev.org/opendev/infra-manual/latest/developers.html
The change description is very important and the major source of historical information. It is expected a developer can read the description of a change and have the context to generally understand why it was introduced. Comments in the code-review system are useful to understand the deeper history of each change, but each change should stand-alone once committed. Only the most trivial of changes that are completely self-evident (e.g. typo fixes) would be expected to have less than a few sentences of context in their change log.
Lifecycle
We welcome all changes and contributions to the project.
Before starting work to deploy a new service that will require resources, you should do some preparation work. Putting an item on the weekly team meeting agenda agenda is always welcome. Logs of previous meetings can be seen at https://meetings.opendev.org/#OpenDev_Meeting. More complicated changes may justify going through the spec process; see https://opendev.org/opendev/infra-specs. If the existing admins are aware of the details before reviews start appearing it makes the process much smoother.
All preliminary work can be done in an iterative fashion using the CI
jobs at your own pace. The #opendev
IRC channel on
OFTC
is a good place to find help during this process.
Alternatively, questions are welcome on the service-discuss
list This change (or changes) will be reviewed and may take a few
rounds before final approval (in Gerrit terms, a +2
vote).
Most changes will receive a few -1
votes from reviewers
during development. This is really just a flag to note that some further
discussion is required; it is not a rejection.
You can set Workflow
to -1
in Gerrit on
changes you are working on, or some developers like to put
[WIP]
at the front of their change description to indicate
to reviewers they probably shouldn't spend much time on this yet, as you
are still working on it. Small, stand-alone sequential changes are
encouraged, and Zuul makes testing such "stacks" of changes trivial.
We currently have admins manually deploy production virtual-machines, storage attached to those machines and secrets to the bastion host. This will need to happen before changes are put into production. Discussion with the admins will help decide on which cloud provider, the VM storage/size and other such matters.
Once resources are allocated and the new host is available in the inventory, the production jobs can deploy. After this the service moves into a maintenance phase; changes can be proposed and, after review, deployed.