system-config/doc/source/open-infrastructure.rst

:title: Open Infrastructure Technical Overview

.. _opendev-infra-overview:

Open Infrastructure Technical Overview
######################################

The OpenDev system administration team strives to run the services
behind the OpenDev Collaboratory as an open source project; we term
this *open infrastructure*.

Our infrastructure is code and contributions to it are handled just
like the rest of OpenDev.  This means that anyone can contribute to
the installation and long-running maintenance of systems without shell
access, and anyone who is interested can provide feedback and
collaborate on code reviews.  There are no permissions or special
privileges required to contribute to the OpenDev infrastructure
project.

Below is a short guide to the major pieces of the project.  Some
knowledge of Zuul job configuration, Ansible, interaction with the
Gerrit code-review system and general Linux administration are
assumed; however expertise is not required.

Operating environment
---------------------

The OpenDev production systems run in resources (compute, network,
storage) provided by donations from companies who support the project.

Our standard production system is based on the latest Ubuntu LTS
release.

Production systems are deployed by Ansible.  Most production
applications run from containers; some are custom built and others we
use unmodified from upstream sources.

Zuul handles the testing and deployment of all changes.  Current
trends would refer to this as a *gitops* model -- all production
changes are ultimately driven by a change proposed to the code-review
system.  This means we do not have bespoke production systems and any
modifications we make are reviewed by peers and logged with change
history.

We have a *bastion host*, or *bridge*, which is a static host with
permissions to deploy to the production systems.  Zuul will run
Ansible on the production systems via this host to deploy new changes
into production.

Getting started - CI
--------------------

The configuration of every system operated by the OpenDev sysadmins is
managed by Ansible and driven by continuous integration and deployment
by Zuul.  This is almost exclusively driven by code kept in the
``system-config`` repository, which can be browsed at:

  https://opendev.org/opendev/system-config

All system configuration should be encoded in that repository so that
anyone may propose a change in the running configuration to Gerrit.

Any change to the OpenDev infrastructure system is first proposed as a
review to this repository at ``review.opendev.org``.  The current open
reviews can be seen at

  https://review.opendev.org/q/project:opendev/system-config

Zuul will first run CI on all incoming changes.  Each service
generally has its own CI job that runs when relevant files
(configuration, Ansible roles, playbooks, etc.) are updated.  These
are generally called ``system-config-run-<service>``; Zuul will post a
comment when the change has been tested, or you can see in-flight
testing at the status page

  https://zuul.opendev.org/t/openstack/status

These jobs are crafted in a way that they replicate production as much
as possible.  Reading the job definitions in in
:git_file:`zuul.d/system-config-run.yaml` will give you a feel for the
hosts that are set up with each job.  When you view the job results in
the Zuul UI, you will see many logs collected from a number of hosts
that simulate the production environment.  This has all the
information you generally need to debug problems, but the best place
to start is with the *artifacts* tab, which has some curated links to
useful overviews.

One of the job artifacts is the `ARA report
<https://ara.readthedocs.io/en/latest/>`__.  This is a graphical view
of the *nested* Ansible run on the (ephemeral) bastion host against
the (ephemeral) production-test nodes.  This is generally the first
stop for finding deployment issues.

Another artifact is the ``testinfra results``.  `Testinfra
<https://testfinra.readthedoocs.io>`__ allows us to define
unit-test-like behaviour to test functionality such as service and API
status, correct deployment of users and files and other interesting
details.  Failures here would indicate the the deployment steps
worked, but some part of the operation of that system is not as we
expect.  The ``testinfra`` code driving this is kept in
:git_file:`testinfra` and test files are named for the service they
test.

Finally there is a ``screenshots`` artifact, which is a link to a
directory that some tests populate with image files.  Tests that are
bringing up interactive services will use a headless browser to take
shots of important pages to verify correct operation.

The logs tab has links the the raw logs; this collects much more
detail such as ``syslog``, Apache logs, database dumps, etc.  Once you
have identified the general problem from the above steps, these logs
provide the in-depth details for further analysis.

Playbooks and roles
-------------------

The starting point for all services is generally the playbooks and
roles kept in :git_file:`playbooks/`.  Most playbooks are named
``service-<name>.yaml`` and will indicate from their naming which
production areas they drive.

During testing, these same playbooks are run against the test nodes.
You can note that the testing hosts are given names that match the
group configuration in the jobs defined in
:git_file:`zuul.d/system-config-run.yaml`.

These playbooks are usually small and they call out to roles where
most of the work is done.  Roles are kept in
:git_file:`playbooks/roles/`.  These roles are written to be as
generic as possible, but they are not expected to be used outside the
OpenDev production deployment system.

These playbooks and roles are the same for CI and deployment.

Hosts and variables
-------------------

The playbooks above run on groups of hosts which are defined in
:git_file:`inventory/service/groups.yaml`.

The production hosts are kept in an inventory at
:git_file:`inventory/base/hosts.yaml`.  In CI, the inventory is
generated by Zuul (as it is allocating ephemeral nodes from the
testing pool).

Public production and testing variables are kept under
:git_file:`inventory/`.  The one difference between CI and production
is *secrets* such as API keys, tokens and passwords; in production the
*nested* Ansible will populate these variables for the deployment
directly from values stored on the bastion host.  In CI, dummy values
should be populated into the templates under
:git_file:`playbooks/zuul/templates/`.

Production secrets are currently managed manually by OpenDev
administrators on the bastion host.

Deployment
----------

After review and approval of a change, Zuul will perform final gate
testing and merge the change on your behalf.

Just as uploading a new change triggers Zuul to run CI tests in the
*check* pipeline, and approving a change triggers Zuul to run gate
tests and merge in the *gate* pipeline, the merge of a change triggers
Zuul to run the deployment jobs in the *deploy* pipeline.

These jobs are named ``infra-prod-<service>`` and run the same
playbooks and roles as in the CI system, except against the production
services.  Zuul will deploy the merged changes to the bastion host,
and then trigger the bastion host to run a *nested* Ansible deployment
against the production host..

Since the production run logs may leak sensitive information, they are
not published openly.  You can add a GPG public key to
:git_file:`playbooks/zuul/roles/encrypt-logs/defaults/main.yaml` and
then ensure the ``infra-prod-<service>`` production has your name in
its ``encrypt_logs_job_recipients`` variable.  Once approved and
committed, you will then be able to view the encrypted production log
output provided via the Zuul build page for the production run.

Containers
----------

Most services are containerised.  When looking at the
``system-config-run-*`` and ``infra-prod-*`` jobs you may see dependencies
on container build/upload/promote jobs; this indicates we have jobs
that build a bespoke container for this environment.

The base ``Dockerfile`` for these containers is found under
:git_file:``docker/``.  Most are straight forward, but some of the more
complicated services have multiple steps and layers.  Any changes to
the ``Dockerfile`` will be tested as usual, and when approved the
containers will be rebuilt, published and pulled onto the production
systems automatically.

Certificates
------------

We provision SSL certificates from LetsEncrypt; see
:ref:`letsencrypt`.

DNS
---

DNS for ``opendev.org`` (and some other domains) is also handled through
the review system; see the
`<https://opendev.org/opendev/zone-opendev.org/>`__ project.

Backups
-------

Any host in the ``backup`` group will have backups to two
geographically distinct locations setup by the deployment
infrastructure.  See the ``borg-backup`` role for details on including
or excluding various data.

Remote access
-------------

Hosts are only configured by Ansible, but they can be setup for
interactive access if required.

Add your public key to :git_file:`inventory/base/group_vars/all.yaml`
  and include a stanza like this in your server ``host_vars``::

    extra_users:
      - your_user_name

See :ref:`ssh-access` for details on keys.

Documentation
-------------

Each service should have an RST file with documentation about the
server and services in :git_file:`doc/source/`.

Submitting Changes
------------------

If you are not familiar with submitting changes to Gerrit, you can
start with any of the various developer guides such as ::

  https://docs.opendev.org/opendev/infra-manual/latest/gettingstarted.html
  https://docs.openstack.org/doc-contrib-guide/quickstart/first-timers.html
  https://docs.opendev.org/opendev/infra-manual/latest/developers.html

The change description is very important and the major source of
historical information.  It is expected a developer can read the
description of a change and have the context to generally understand
why it was introduced.  Comments in the code-review system are useful
to understand the deeper history of each change, but each change
should stand-alone once committed.  Only the most trivial of changes
that are completely self-evident (e.g. typo fixes) would be expected
to have less than a few sentences of context in their change log.

Lifecycle
---------

We welcome all changes and contributions to the project.

Before starting work to deploy a new service that will require
resources, you should do some preparation work.  Putting an item on
the `weekly team meeting agenda
<https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting>`__ agenda
is always welcome.  Logs of previous meetings can be seen at
`<https://meetings.opendev.org/#OpenDev_Meeting>`__.  More complicated
changes may justify going through the spec process; see
`<https://opendev.org/opendev/infra-specs>`_.  If the existing admins
are aware of the details before reviews start appearing it makes the
process much smoother.

All preliminary work can be done in an iterative fashion using the CI
jobs at your own pace.  The ``#opendev`` IRC channel on ``OFTC`` is a
good place to find help during this process.  Alternatively, questions
are welcome on the `service-discuss list
<http://lists.opendev.org/cgi-bin/mailman/listinfo/service-discuss>`__
This change (or changes) will be reviewed and may take a few rounds
before final approval (in Gerrit terms, a ``+2`` vote).  Most changes
will receive a few ``-1`` votes from reviewers during development.
This is really just a flag to note that some further discussion is
required; it is not a rejection.

You can set ``Workflow`` to ``-1`` in Gerrit on changes you are
working on, or some developers like to put ``[WIP]`` at the front of
their change description to indicate to reviewers they probably
shouldn't spend much time on this yet, as you are still working on it.
Small, stand-alone sequential changes are encouraged, and Zuul makes
testing such "stacks" of changes trivial.

We currently have admins manually deploy production virtual-machines,
storage attached to those machines and secrets to the bastion host.
This will need to happen before changes are put into production.
Discussion with the admins will help decide on which cloud provider,
the VM storage/size and other such matters.

Once resources are allocated and the new host is available in the
inventory, the production jobs can deploy.  After this the service
moves into a maintenance phase; changes can be proposed and, after
review, deployed.