4c86706e5e
This introduces and "Open Infrastructure" page which is designed for a moderately experienced developer with some understanding of Zuul, Ansible and basic Linux admin skills to have an entrypoint to navigating the system-config and related repositories. It is designed to re-enforce the idea of open infrastructure, and explain how development, testing and production come together at a level high enough to be understood, but with links or descriptions of specific places in the code to get started. It moves a little of what was in the sysadmin page into this, and leaves that page as more low-level descriptions of various tasks. Change-Id: I60a9299df455b98ad549ac0075a59d381722bc06
302 lines
12 KiB
ReStructuredText
302 lines
12 KiB
ReStructuredText
:title: Open Infrastructure Technical Overview
|
|
|
|
.. _opendev-infra-overview:
|
|
|
|
Open Infrastructure Technical Overview
|
|
######################################
|
|
|
|
The OpenDev system administration team strives to run the services
|
|
behind the OpenDev Collaboratory as an open source project; we term
|
|
this *open infrastructure*.
|
|
|
|
Our infrastructure is code and contributions to it are handled just
|
|
like the rest of OpenDev. This means that anyone can contribute to
|
|
the installation and long-running maintenance of systems without shell
|
|
access, and anyone who is interested can provide feedback and
|
|
collaborate on code reviews. There are no permissions or special
|
|
privileges required to contribute to the OpenDev infrastructure
|
|
project.
|
|
|
|
Below is a short guide to the major pieces of the project. Some
|
|
knowledge of Zuul job configuration, Ansible, interaction with the
|
|
Gerrit code-review system and general Linux administration are
|
|
assumed; however expertise is not required.
|
|
|
|
Operating environment
|
|
---------------------
|
|
|
|
The OpenDev production systems run in resources (compute, network,
|
|
storage) provided by donations from companies who support the project.
|
|
|
|
Our standard production system is based on the latest Ubuntu LTS
|
|
release.
|
|
|
|
Production systems are deployed by Ansible. Most production
|
|
applications run from containers; some are custom built and others we
|
|
use unmodified from upstream sources.
|
|
|
|
Zuul handles the testing and deployment of all changes. Current
|
|
trends would refer to this as a *gitops* model -- all production
|
|
changes are ultimately driven by a change proposed to the code-review
|
|
system. This means we do not have bespoke production systems and any
|
|
modifications we make are reviewed by peers and logged with change
|
|
history.
|
|
|
|
We have a *bastion host*, or *bridge*, which is a static host with
|
|
permissions to deploy to the production systems. Zuul will run
|
|
Ansible on the production systems via this host to deploy new changes
|
|
into production.
|
|
|
|
Getting started - CI
|
|
--------------------
|
|
|
|
The configuration of every system operated by the OpenDev sysadmins is
|
|
managed by Ansible and driven by continuous integration and deployment
|
|
by Zuul. This is almost exclusively driven by code kept in the
|
|
``system-config`` repository, which can be browsed at:
|
|
|
|
https://opendev.org/opendev/system-config
|
|
|
|
All system configuration should be encoded in that repository so that
|
|
anyone may propose a change in the running configuration to Gerrit.
|
|
|
|
Any change to the OpenDev infrastructure system is first proposed as a
|
|
review to this repository at ``review.opendev.org``. The current open
|
|
reviews can be seen at
|
|
|
|
https://review.opendev.org/q/project:opendev/system-config
|
|
|
|
Zuul will first run CI on all incoming changes. Each service
|
|
generally has its own CI job that runs when relevant files
|
|
(configuration, Ansible roles, playbooks, etc.) are updated. These
|
|
are generally called ``system-config-run-<service>``; Zuul will post a
|
|
comment when the change has been tested, or you can see in-flight
|
|
testing at the status page
|
|
|
|
https://zuul.opendev.org/t/openstack/status
|
|
|
|
These jobs are crafted in a way that they replicate production as much
|
|
as possible. Reading the job definitions in in
|
|
:git_file:`zuul.d/system-config-run.yaml` will give you a feel for the
|
|
hosts that are set up with each job. When you view the job results in
|
|
the Zuul UI, you will see many logs collected from a number of hosts
|
|
that simulate the production environment. This has all the
|
|
information you generally need to debug problems, but the best place
|
|
to start is with the *artifacts* tab, which has some curated links to
|
|
useful overviews.
|
|
|
|
One of the job artifacts is the `ARA report
|
|
<https://ara.readthedocs.io/en/latest/>`__. This is a graphical view
|
|
of the *nested* Ansible run on the (ephemeral) bastion host against
|
|
the (ephemeral) production-test nodes. This is generally the first
|
|
stop for finding deployment issues.
|
|
|
|
Another artifact is the ``testinfra results``. `Testinfra
|
|
<https://testfinra.readthedoocs.io>`__ allows us to define
|
|
unit-test-like behaviour to test functionality such as service and API
|
|
status, correct deployment of users and files and other interesting
|
|
details. Failures here would indicate the the deployment steps
|
|
worked, but some part of the operation of that system is not as we
|
|
expect. The ``testinfra`` code driving this is kept in
|
|
:git_file:`testinfra` and test files are named for the service they
|
|
test.
|
|
|
|
Finally there is a ``screenshots`` artifact, which is a link to a
|
|
directory that some tests populate with image files. Tests that are
|
|
bringing up interactive services will use a headless browser to take
|
|
shots of important pages to verify correct operation.
|
|
|
|
The logs tab has links the the raw logs; this collects much more
|
|
detail such as ``syslog``, Apache logs, database dumps, etc. Once you
|
|
have identified the general problem from the above steps, these logs
|
|
provide the in-depth details for further analysis.
|
|
|
|
Playbooks and roles
|
|
-------------------
|
|
|
|
The starting point for all services is generally the playbooks and
|
|
roles kept in :git_file:`playbooks/`. Most playbooks are named
|
|
``service-<name>.yaml`` and will indicate from their naming which
|
|
production areas they drive.
|
|
|
|
During testing, these same playbooks are run against the test nodes.
|
|
You can note that the testing hosts are given names that match the
|
|
group configuration in the jobs defined in
|
|
:git_file:`zuul.d/system-config-run.yaml`.
|
|
|
|
These playbooks are usually small and they call out to roles where
|
|
most of the work is done. Roles are kept in
|
|
:git_file:`playbooks/roles/`. These roles are written to be as
|
|
generic as possible, but they are not expected to be used outside the
|
|
OpenDev production deployment system.
|
|
|
|
These playbooks and roles are the same for CI and deployment.
|
|
|
|
Hosts and variables
|
|
-------------------
|
|
|
|
The playbooks above run on groups of hosts which are defined in
|
|
:git_file:`inventory/service/groups.yaml`.
|
|
|
|
The production hosts are kept in an inventory at
|
|
:git_file:`inventory/base/hosts.yaml`. In CI, the inventory is
|
|
generated by Zuul (as it is allocating ephemeral nodes from the
|
|
testing pool).
|
|
|
|
Public production and testing variables are kept under
|
|
:git_file:`inventory/`. The one difference between CI and production
|
|
is *secrets* such as API keys, tokens and passwords; in production the
|
|
*nested* Ansible will populate these variables for the deployment
|
|
directly from values stored on the bastion host. In CI, dummy values
|
|
should be populated into the templates under
|
|
:git_file:`playbooks/zuul/templates/`.
|
|
|
|
Production secrets are currently managed manually by OpenDev
|
|
administrators on the bastion host.
|
|
|
|
Deployment
|
|
----------
|
|
|
|
After review and approval of a change, Zuul will perform final gate
|
|
testing and merge the change on your behalf.
|
|
|
|
Just as uploading a new change triggers Zuul to run CI tests in the
|
|
*check* pipeline, and approving a change triggers Zuul to run gate
|
|
tests and merge in the *gate* pipeline, the merge of a change triggers
|
|
Zuul to run the deployment jobs in the *deploy* pipeline.
|
|
|
|
These jobs are named ``infra-prod-<service>`` and run the same
|
|
playbooks and roles as in the CI system, except against the production
|
|
services. Zuul will deploy the merged changes to the bastion host,
|
|
and then trigger the bastion host to run a *nested* Ansible deployment
|
|
against the production host..
|
|
|
|
Since the production run logs may leak sensitive information, they are
|
|
not published openly. You can add a GPG public key to
|
|
:git_file:`playbooks/zuul/roles/encrypt-logs/defaults/main.yaml` and
|
|
then ensure the ``infra-prod-<service>`` production has your name in
|
|
its ``encrypt_logs_job_recipients`` variable. Once approved and
|
|
committed, you will then be able to view the encrypted production log
|
|
output provided via the Zuul build page for the production run.
|
|
|
|
Containers
|
|
----------
|
|
|
|
Most services are containerised. When looking at the
|
|
``system-config-run-*`` and ``infra-prod-*`` jobs you may see dependencies
|
|
on container build/upload/promote jobs; this indicates we have jobs
|
|
that build a bespoke container for this environment.
|
|
|
|
The base ``Dockerfile`` for these containers is found under
|
|
:git_file:``docker/``. Most are straight forward, but some of the more
|
|
complicated services have multiple steps and layers. Any changes to
|
|
the ``Dockerfile`` will be tested as usual, and when approved the
|
|
containers will be rebuilt, published and pulled onto the production
|
|
systems automatically.
|
|
|
|
Certificates
|
|
------------
|
|
|
|
We provision SSL certificates from LetsEncrypt; see
|
|
:ref:`letsencrypt`.
|
|
|
|
DNS
|
|
---
|
|
|
|
DNS for ``opendev.org`` (and some other domains) is also handled through
|
|
the review system; see the
|
|
`<https://opendev.org/opendev/zone-opendev.org/>`__ project.
|
|
|
|
Backups
|
|
-------
|
|
|
|
Any host in the ``backup`` group will have backups to two
|
|
geographically distinct locations setup by the deployment
|
|
infrastructure. See the ``borg-backup`` role for details on including
|
|
or excluding various data.
|
|
|
|
Remote access
|
|
-------------
|
|
|
|
Hosts are only configured by Ansible, but they can be setup for
|
|
interactive access if required.
|
|
|
|
Add your public key to :git_file:`inventory/base/group_vars/all.yaml`
|
|
and include a stanza like this in your server ``host_vars``::
|
|
|
|
extra_users:
|
|
- your_user_name
|
|
|
|
See :ref:`ssh-access` for details on keys.
|
|
|
|
Documentation
|
|
-------------
|
|
|
|
Each service should have an RST file with documentation about the
|
|
server and services in :git_file:`doc/source/`.
|
|
|
|
Submitting Changes
|
|
------------------
|
|
|
|
If you are not familiar with submitting changes to Gerrit, you can
|
|
start with any of the various developer guides such as ::
|
|
|
|
https://docs.opendev.org/opendev/infra-manual/latest/gettingstarted.html
|
|
https://docs.openstack.org/doc-contrib-guide/quickstart/first-timers.html
|
|
https://docs.opendev.org/opendev/infra-manual/latest/developers.html
|
|
|
|
The change description is very important and the major source of
|
|
historical information. It is expected a developer can read the
|
|
description of a change and have the context to generally understand
|
|
why it was introduced. Comments in the code-review system are useful
|
|
to understand the deeper history of each change, but each change
|
|
should stand-alone once committed. Only the most trivial of changes
|
|
that are completely self-evident (e.g. typo fixes) would be expected
|
|
to have less than a few sentences of context in their change log.
|
|
|
|
Lifecycle
|
|
---------
|
|
|
|
We welcome all changes and contributions to the project.
|
|
|
|
Before starting work to deploy a new service that will require
|
|
resources, you should do some preparation work. Putting an item on
|
|
the `weekly team meeting agenda
|
|
<https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting>`__ agenda
|
|
is always welcome. Logs of previous meetings can be seen at
|
|
`<https://meetings.opendev.org/#OpenDev_Meeting>`__. More complicated
|
|
changes may justify going through the spec process; see
|
|
`<https://opendev.org/opendev/infra-specs>`_. If the existing admins
|
|
are aware of the details before reviews start appearing it makes the
|
|
process much smoother.
|
|
|
|
All preliminary work can be done in an iterative fashion using the CI
|
|
jobs at your own pace. The ``#opendev`` IRC channel on ``OFTC`` is a
|
|
good place to find help during this process. Alternatively, questions
|
|
are welcome on the `service-discuss list
|
|
<http://lists.opendev.org/cgi-bin/mailman/listinfo/service-discuss>`__
|
|
This change (or changes) will be reviewed and may take a few rounds
|
|
before final approval (in Gerrit terms, a ``+2`` vote). Most changes
|
|
will receive a few ``-1`` votes from reviewers during development.
|
|
This is really just a flag to note that some further discussion is
|
|
required; it is not a rejection.
|
|
|
|
You can set ``Workflow`` to ``-1`` in Gerrit on changes you are
|
|
working on, or some developers like to put ``[WIP]`` at the front of
|
|
their change description to indicate to reviewers they probably
|
|
shouldn't spend much time on this yet, as you are still working on it.
|
|
Small, stand-alone sequential changes are encouraged, and Zuul makes
|
|
testing such "stacks" of changes trivial.
|
|
|
|
We currently have admins manually deploy production virtual-machines,
|
|
storage attached to those machines and secrets to the bastion host.
|
|
This will need to happen before changes are put into production.
|
|
Discussion with the admins will help decide on which cloud provider,
|
|
the VM storage/size and other such matters.
|
|
|
|
Once resources are allocated and the new host is available in the
|
|
inventory, the production jobs can deploy. After this the service
|
|
moves into a maintenance phase; changes can be proposed and, after
|
|
review, deployed.
|
|
|