system-config/doc/source/open-infrastructure.rst
Ian Wienand 4c86706e5e docs: reorganise around a open infrastructure overview
This introduces and "Open Infrastructure" page which is designed for a
moderately experienced developer with some understanding of Zuul,
Ansible and basic Linux admin skills to have an entrypoint to
navigating the system-config and related repositories.

It is designed to re-enforce the idea of open infrastructure, and
explain how development, testing and production come together at a
level high enough to be understood, but with links or descriptions of
specific places in the code to get started.

It moves a little of what was in the sysadmin page into this, and
leaves that page as more low-level descriptions of various tasks.

Change-Id: I60a9299df455b98ad549ac0075a59d381722bc06
2022-03-04 12:18:42 +11:00

12 KiB

title

Open Infrastructure Technical Overview

Open Infrastructure Technical Overview

The OpenDev system administration team strives to run the services behind the OpenDev Collaboratory as an open source project; we term this open infrastructure.

Our infrastructure is code and contributions to it are handled just like the rest of OpenDev. This means that anyone can contribute to the installation and long-running maintenance of systems without shell access, and anyone who is interested can provide feedback and collaborate on code reviews. There are no permissions or special privileges required to contribute to the OpenDev infrastructure project.

Below is a short guide to the major pieces of the project. Some knowledge of Zuul job configuration, Ansible, interaction with the Gerrit code-review system and general Linux administration are assumed; however expertise is not required.

Operating environment

The OpenDev production systems run in resources (compute, network, storage) provided by donations from companies who support the project.

Our standard production system is based on the latest Ubuntu LTS release.

Production systems are deployed by Ansible. Most production applications run from containers; some are custom built and others we use unmodified from upstream sources.

Zuul handles the testing and deployment of all changes. Current trends would refer to this as a gitops model -- all production changes are ultimately driven by a change proposed to the code-review system. This means we do not have bespoke production systems and any modifications we make are reviewed by peers and logged with change history.

We have a bastion host, or bridge, which is a static host with permissions to deploy to the production systems. Zuul will run Ansible on the production systems via this host to deploy new changes into production.

Getting started - CI

The configuration of every system operated by the OpenDev sysadmins is managed by Ansible and driven by continuous integration and deployment by Zuul. This is almost exclusively driven by code kept in the system-config repository, which can be browsed at:

https://opendev.org/opendev/system-config

All system configuration should be encoded in that repository so that anyone may propose a change in the running configuration to Gerrit.

Any change to the OpenDev infrastructure system is first proposed as a review to this repository at review.opendev.org. The current open reviews can be seen at

https://review.opendev.org/q/project:opendev/system-config

Zuul will first run CI on all incoming changes. Each service generally has its own CI job that runs when relevant files (configuration, Ansible roles, playbooks, etc.) are updated. These are generally called system-config-run-<service>; Zuul will post a comment when the change has been tested, or you can see in-flight testing at the status page

https://zuul.opendev.org/t/openstack/status

These jobs are crafted in a way that they replicate production as much as possible. Reading the job definitions in in :git_file:`zuul.d/system-config-run.yaml` will give you a feel for the hosts that are set up with each job. When you view the job results in the Zuul UI, you will see many logs collected from a number of hosts that simulate the production environment. This has all the information you generally need to debug problems, but the best place to start is with the artifacts tab, which has some curated links to useful overviews.

One of the job artifacts is the ARA report. This is a graphical view of the nested Ansible run on the (ephemeral) bastion host against the (ephemeral) production-test nodes. This is generally the first stop for finding deployment issues.

Another artifact is the testinfra results. Testinfra allows us to define unit-test-like behaviour to test functionality such as service and API status, correct deployment of users and files and other interesting details. Failures here would indicate the the deployment steps worked, but some part of the operation of that system is not as we expect. The testinfra code driving this is kept in :git_file:`testinfra` and test files are named for the service they test.

Finally there is a screenshots artifact, which is a link to a directory that some tests populate with image files. Tests that are bringing up interactive services will use a headless browser to take shots of important pages to verify correct operation.

The logs tab has links the the raw logs; this collects much more detail such as syslog, Apache logs, database dumps, etc. Once you have identified the general problem from the above steps, these logs provide the in-depth details for further analysis.

Playbooks and roles

The starting point for all services is generally the playbooks and roles kept in :git_file:`playbooks/. Most playbooks are named service-<name>.yaml` and will indicate from their naming which production areas they drive.

During testing, these same playbooks are run against the test nodes. You can note that the testing hosts are given names that match the group configuration in the jobs defined in :git_file:`zuul.d/system-config-run.yaml`.

These playbooks are usually small and they call out to roles where most of the work is done. Roles are kept in :git_file:`playbooks/roles/`. These roles are written to be as generic as possible, but they are not expected to be used outside the OpenDev production deployment system.

These playbooks and roles are the same for CI and deployment.

Hosts and variables

The playbooks above run on groups of hosts which are defined in :git_file:`inventory/service/groups.yaml`.

The production hosts are kept in an inventory at :git_file:`inventory/base/hosts.yaml`. In CI, the inventory is generated by Zuul (as it is allocating ephemeral nodes from the testing pool).

Public production and testing variables are kept under :git_file:`inventory/. The one difference between CI and production is *secrets* such as API keys, tokens and passwords; in production the *nested* Ansible will populate these variables for the deployment directly from values stored on the bastion host. In CI, dummy values should be populated into the templates under :git_file:`playbooks/zuul/templates/.

Production secrets are currently managed manually by OpenDev administrators on the bastion host.

Deployment

After review and approval of a change, Zuul will perform final gate testing and merge the change on your behalf.

Just as uploading a new change triggers Zuul to run CI tests in the check pipeline, and approving a change triggers Zuul to run gate tests and merge in the gate pipeline, the merge of a change triggers Zuul to run the deployment jobs in the deploy pipeline.

These jobs are named infra-prod-<service> and run the same playbooks and roles as in the CI system, except against the production services. Zuul will deploy the merged changes to the bastion host, and then trigger the bastion host to run a nested Ansible deployment against the production host..

Since the production run logs may leak sensitive information, they are not published openly. You can add a GPG public key to :git_file:`playbooks/zuul/roles/encrypt-logs/defaults/main.yaml and then ensure the infra-prod-<service> production has your name in its encrypt_logs_job_recipients` variable. Once approved and committed, you will then be able to view the encrypted production log output provided via the Zuul build page for the production run.

Containers

Most services are containerised. When looking at the system-config-run-* and infra-prod-* jobs you may see dependencies on container build/upload/promote jobs; this indicates we have jobs that build a bespoke container for this environment.

The base Dockerfile for these containers is found under :git_file:docker/. Most are straight forward, but some of the more complicated services have multiple steps and layers. Any changes to the Dockerfile will be tested as usual, and when approved the containers will be rebuilt, published and pulled onto the production systems automatically.

Certificates

We provision SSL certificates from LetsEncrypt; see letsencrypt.

DNS

DNS for opendev.org (and some other domains) is also handled through the review system; see the https://opendev.org/opendev/zone-opendev.org/ project.

Backups

Any host in the backup group will have backups to two geographically distinct locations setup by the deployment infrastructure. See the borg-backup role for details on including or excluding various data.

Remote access

Hosts are only configured by Ansible, but they can be setup for interactive access if required.

Add your public key to :git_file:`inventory/base/group_vars/all.yaml and include a stanza like this in your server host_vars`:

extra_users:
  - your_user_name

See ssh-access for details on keys.

Documentation

Each service should have an RST file with documentation about the server and services in :git_file:`doc/source/`.

Submitting Changes

If you are not familiar with submitting changes to Gerrit, you can start with any of the various developer guides such as :

https://docs.opendev.org/opendev/infra-manual/latest/gettingstarted.html
https://docs.openstack.org/doc-contrib-guide/quickstart/first-timers.html
https://docs.opendev.org/opendev/infra-manual/latest/developers.html

The change description is very important and the major source of historical information. It is expected a developer can read the description of a change and have the context to generally understand why it was introduced. Comments in the code-review system are useful to understand the deeper history of each change, but each change should stand-alone once committed. Only the most trivial of changes that are completely self-evident (e.g. typo fixes) would be expected to have less than a few sentences of context in their change log.

Lifecycle

We welcome all changes and contributions to the project.

Before starting work to deploy a new service that will require resources, you should do some preparation work. Putting an item on the weekly team meeting agenda agenda is always welcome. Logs of previous meetings can be seen at https://meetings.opendev.org/#OpenDev_Meeting. More complicated changes may justify going through the spec process; see https://opendev.org/opendev/infra-specs. If the existing admins are aware of the details before reviews start appearing it makes the process much smoother.

All preliminary work can be done in an iterative fashion using the CI jobs at your own pace. The #opendev IRC channel on OFTC is a good place to find help during this process. Alternatively, questions are welcome on the service-discuss list This change (or changes) will be reviewed and may take a few rounds before final approval (in Gerrit terms, a +2 vote). Most changes will receive a few -1 votes from reviewers during development. This is really just a flag to note that some further discussion is required; it is not a rejection.

You can set Workflow to -1 in Gerrit on changes you are working on, or some developers like to put [WIP] at the front of their change description to indicate to reviewers they probably shouldn't spend much time on this yet, as you are still working on it. Small, stand-alone sequential changes are encouraged, and Zuul makes testing such "stacks" of changes trivial.

We currently have admins manually deploy production virtual-machines, storage attached to those machines and secrets to the bastion host. This will need to happen before changes are put into production. Discussion with the admins will help decide on which cloud provider, the VM storage/size and other such matters.

Once resources are allocated and the new host is available in the inventory, the production jobs can deploy. After this the service moves into a maintenance phase; changes can be proposed and, after review, deployed.