system-config/doc/source/afs.rst

:title: OpenAFS

.. _openafs:

OpenAFS
#######

The Andrew Filesystem (or AFS) is a global distributed filesystem.
With a single mountpoint, clients can access any site on the Internet
which is running AFS as if it were a local filesystem.

OpenAFS is an open source implementation of the AFS services and
utilities.

A collection of AFS servers and volumes that are collectively
administered within a site is called a ``cell``.  The OpenStack
project runs the ``openstack.org`` AFS cell, accessible at
``/afs/openstack.org/``.

At a Glance
===========

:Hosts:
  * afsdb01.openstack.org (a vldb and pts server in DFW)
  * afsdb02.openstack.org (a vldb and pts server in ORD)
  * afsdb03.openstack.org (a second vldb and pts server in DFW)
  * afs01.dfw.openstack.org (a fileserver in DFW)
  * afs02.dfw.openstack.org (a second fileserver in DFW)
  * afs01.ord.openstack.org (a fileserver in ORD)
  * mirror-update.opendev.org (host running mirror update jobs)
:Ansible:
  * :git_file:`playbooks/service-afs.yaml`
  * :git_file:`playbooks/service-mirror.yaml`
  * :git_file:`playbooks/service-mirror-update.yaml`
:Projects:
  * http://openafs.org/
:Bugs:
  * http://bugs.launchpad.net/openstack-ci
  * http://rt.central.org/rt/Search/Results.html?Order=ASC&DefaultQueue=10&Query=Queue%20%3D%20%27openafs-bugs%27%20AND%20%28Status%20%3D%20%27open%27%20OR%20Status%20%3D%20%27new%27%29&Rows=50&OrderBy=id&Page=1&Format=&user=guest&pass=guest
:Resources:
  * `OpenAFS Documentation <http://docs.openafs.org/index.html>`_

OpenStack Cell
--------------

AFS may be one of the most thoroughly documented systems in the world.
There is plenty of very good information about how AFS works and the
commands to use it.  This document will only cover the minimum needed
to understand our deployment of it.

OpenStack runs an AFS cell called ``openstack.org``.  There are three
important services provided by a cell: the volume location database
(VLDB), the protection database (PTS), and the file server (FS).  The
volume location service answers queries from clients about which
fileservers should be contacted to access particular volumes, while
the protection service provides information about users and groups.

Our implementation follows the common recommendation to colocate the
VLDB and PTS servers, and so they both run on our afsdb* servers.
These servers all have the same information and communicate with each
other to keep in sync and automatically provide high-availability
service.  As described in
`<https://docs.openafs.org/AdminGuide/HDRWQ101.html>`__ the Ubik
protocol requires three servers to maintain availability; for that
reason, two of our DB servers are in the DFW region, and the other in
ORD.

Fileservers contain volumes, each of which is a portion of the file
space provided by that cell.  A volume appears as at least one
directory, but may contain directories within the volume.  Volumes are
mounted within other volumes to construct the filesystem hierarchy of
the cell.

OpenStack has two fileservers in DFW and one in ORD.  They do not
automatically contain copies of the same data.  A read-write volume in
AFS can only exist on exactly one fileserver, and if that fileserver
is out of service, the volumes it serves are not available.  However,
volumes may have read-write copies which are stored on other
fileservers.  If a client requests a read-only volume, as long as one
site with a read-only volume is online, it will be available.

Client Configuration
--------------------
.. _afs_client:

To use OpenAFS on a Debian or Ubuntu machine::

  sudo apt-get install openafs-client openafs-krb5 krb5-user

Debconf will ask you for a default realm, cell and cache size.
Answer::

  Default Kerberos version 5 realm: OPENSTACK.ORG
  AFS cell this workstation belongs to: openstack.org
  Size of AFS cache in kB: 500000

The default cache size in debconf is 50000 (50MB) which is not very
large.  We recommend setting it to 500000 (500MB -- add a zero to the
default debconf value), or whatever is appropriate for your system.

The OpenAFS client is not started by default, so you will need to
run::

  sudo service openafs-client start

When it's done, you should be able to ``cd /afs/openstack.org``.

Most of what is in our AFS cell does not require authentication.
However, if you have a principal in kerberos, you can get an
authentication token for use with AFS with::

  kinit
  aklog

If not running on Debian or Ubuntu you can install openafs client
packages as well as Kerberos5 packages on your distro of choice.

Then to kinit, use your fully qualified user id::

  kinit $USERNAME@OPENSTACK.ORG

Or for admin access::

  kinit $USERNAME/admin@OPENSTACK.ORG

Then aklog, specifying the openstack.org cell::

  aklog -cell openstack.org

Administration
--------------

The following information is relevant to AFS administrators.

All of these commands have excellent manpages which can be accessed
with commands like ``man vos`` or ``man vos create``.  They also
provide short help messages when run like ``vos -help`` or ``vos
create -help``.

For all administrative commands, you may either run them from any AFS
client machine while authenticated as an AFS admin, or locally without
authentication on an AFS server machine by appending the `-localauth`
flag to the end of the command.

Adding a User
~~~~~~~~~~~~~
First, add a kerberos principal as described in :ref:`addprinc`.  Have the
username and UID from puppet ready.

Then add the user to the protection database with::

  pts createuser $USERNAME -id UID

Admin UIDs start at 1 and increment.  If you are adding a new admin
user, you must run ``pts listentries``, find the highest UID for an
admin user, increment it by one and use that as the UID.  The username
for an admin user should be in the form ``username.admin``.

.. note::
  Any '/' characters in a kerberos principal become '.' characters in
  AFS.

Adding a Superuser
~~~~~~~~~~~~~~~~~~
Run the following commands to add an existing principal to AFS as a
superuser::

  pts adduser -user $USERNAME.admin -group system:administrators

After this, you should update the
:git_file:`playbooks/roles/openafs-server-config/files/UserList` file
to ensure the new username is authorized to issue privileged commands.

Deleting Files
~~~~~~~~~~~~~~

.. note::
  This is a basic example of write operations for AFS-hosted
  content, so applies more generally to manually adding or changing
  files as well. As we semi-regularly get requests to delete
  subtrees of documentation, this serves as a good demonstration.

First, as a prerequisite, make sure you've followed the `Client
Configuration`_ and `Adding a Superuser`_ steps for yourself and
that you know the password for your ``$USERNAME/admin`` kerberos
principal. Safely authenticate your superuser's principal in a new
PAG as follows::

  pagsh -c /bin/bash
  export KRB5CCNAME=FILE:`mktemp`
  kinit $USERNAME/admin
  aklog

If this is a potentially destructive change (perhaps you're worried
you might mistype a deletion and remove more content than you
intended) you can first create a copy-on-write backup snapshot like
so::

  vos backup docs

When deleting files, note that you should use the read-write
``/afs/.openstack.org`` path rather than the read-only
``/afs/openstack.org`` path, but normal Unix file manipulation
commands work as expected (do _not_ use ``sudo`` for this)::

  rm -rf /afs/.openstack.org/docs/project-install-guide/baremetal/draft

If you don't want to have to wait for a volume release to happen (so
that your changes to the read-write filesystem are reflected
immediately in the read-only filesystem), you can release it now
too::

  vos release docs -verbose

Now you can clean up your session, destroy your ticket and exit the
temporary PAG thusly::

  unlog
  kdestroy
  exit

Creating a Volume
~~~~~~~~~~~~~~~~~

..
   See following for background on the issues
   http://meetings.opendev.org//irclogs/%23opendev/%23opendev.2020-06-10.log.html#t2020-06-10T22:13:43
   https://review.opendev.org/#/c/735061/

.. warning::

   You should *not* run these operations on the fileservers where the
   volumes or replicas are to be created (``afs01`` or ``afs02``).
   openafs ```vos`` will resolve the ipv4 address of the fileserver
   host from the command-line.  If you are using the tool on the
   fileserver, Debuntu's use of ``127.0.1.1`` for localhost and having
   the hostname in ``/etc/hosts`` can thus result in the ``vos`` tool
   not correctly filtering the loopback address and setting the server
   address for the volume as ``127.0.1.1`` -- making it effectively
   inaccessible.  A similar problem can occur for NAT servers, if we
   were to use them.  Running on an external host means the lookups
   shouldn't return local addresses and avoids this issue.  The other
   option is to specify the fileservers as the IP address, rather than
   the hostname, to avoid any lookup issues.

Select a fileserver for the read-write copy of the volume according to
which region you wish to locate it after ensuring it has sufficient
free space.  Then run::

  vos create $FILESERVER a $VOLUMENAME

The `a` in the preceding command tells it to place the volume on
partition `vicepa`.  Our fileservers only have one partition and therefore
this is a constant.

Be sure to mount the read-write volume in AFS with::

  fs mkmount /afs/.openstack.org/path/to/mountpoint $VOLUMENAME

You may want to create read-only sites for the volume with ``vos
addsite`` and then ``vos release``.

If the volume's mountpoint lies within another volume, you may also
need to ``vos release`` that parent volume before it will show up in
the read-only path.

You should set the volume quota with ``fs setquota``.

Deleting a Volume
~~~~~~~~~~~~~~~~~

Remove the mountpoint(s) of the volume::

  fs rmmount /afs/.openstack.org/path/to/mountpoint

Be sure to release the parent volume (with ``vos release``) if
necessary after removing the mountpoint.

Run ``vos examine`` to see a list of volume sites.  Identify the
read-write and read-only sites.

Remove the read-only sites first; repeat this command for each one::

  vos remove -server $FILESERVER -partition $PARTITION -id $VOLUME.readonly

Remove the read-write volume::

  vos remove -id $VOLUME

Adding a Fileserver
~~~~~~~~~~~~~~~~~~~
Put the machine's public IP on a single line in
/var/lib/openafs/local/NetInfo (TODO: puppet this).

Copy ``/etc/openafs/server/*`` from an existing fileserver.

Create an LVM volume named ``vicepa`` from cinder volumes.  See
:ref:`cinder` for details on volume management.  Then run::

  mkdir /vicepa
  echo "/dev/main/vicepa  /vicepa ext4  errors=remount-ro,barrier=0  0  2" >>/etc/fstab
  mount -a

Finally, create the fileserver with::

  bos create -server NEWSERVER -instance dafs -type dafs \
    -cmd "/usr/lib/openafs/dafileserver -L -p 242 -busyat 600 -rxpck 700 \
      -s 1200 -l 1200 -cb 2000000 -b 240 -vc 1200 \
      -udpsize 131071 -sendsize 131071" \
    -cmd /usr/lib/openafs/davolserver \
    -cmd /usr/lib/openafs/salvageserver \
    -cmd /usr/lib/openafs/dasalvager

It is worth evaluating these settings periodically

* ``-L`` selects the large size, which ups a number of defaults
* ``-p`` defines the worker threads for processing incoming calls.
  Since they block until there is work to do, we should leave this at
  around the maximum (which may increase across versions; see
  documentation)
* ``-udpsize`` and ``-sendsize`` should be increased above their default
* ``-cb`` defines the callbacks.  For our use case, with a single
  mirror writer, this should be around the number of files the client
  is configured to cache (``-dcache``) multiplied by the number of
  clients.

Updating Settings
~~~~~~~~~~~~~~~~~

The helper script :git_file:`tools/afs-server-restart.sh` is a helper
script to restart AFS servers, and optionally enable audit logging on
the servers which is sometimes useful for debugging afs clients.  You
can edit settings in the script and run ``afs-server-restart.sh
restart`` (or ``restart-auditing``).

If you wish to update the settings for an existing server manually,
you can stop and remove the existing ``bnode`` (the collection of
processes the overseer is monitoring, created via ``bos create``
above) and recreate it.

For example ::

  bos stop -server afs01.dfw.openstack.org \
           -instance dafs \
           -wait

Then remove the server with ::

  bos delete -server afs01.dfw.openstack.org \
             -instance dafs

Finally run the ``bos create`` command above with any modified
parameters to restart the server.

Recovering a Failed Fileserver
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If a fileserver crashes, take the following steps to ensure it's
usable after recovery:

* Pause mirror updates and volume release cron jobs

* Reboot the server; fix any filesystem errors and check the salvager
  logs

* Check for any stuck volume transactions; remedy as appropriate

* Perform a manual release of every volume from a terminal on a server
  using "-localauth" in case OpenAFS decides it can't do an
  incremental update.

* Re-enable cron jobs

Mirrors
~~~~~~~

We host mirrors in AFS so that we store only one copy of the data, but
mirror servers local to each cloud region in which we operate serve
that data to nearby hosts from their local cache.

All of our mirrors are housed under ``/afs/openstack.org/mirror``.
Each mirror is on its own volume, and each with a read-only replica.
This allows mirrors to be updated and then the read-only replicas
atomically updated.  Because mirrors are typically very large and
replication across regions is slow, we place both copies of mirror
data on two fileservers in the same region.  This allows us to perform
maintenance on fileservers hosting mirror data as well deal with
outages related to a single server, but does not protect the mirror
system from a region-wide outage.

In order to establish a new mirror, do the following:

* The following commands need to be run authenticated on a host with
  kerberos and AFS setup (see `afs_client`_; admins can run the
  commands on ``mirror-update.opendev.org``).  See the note above
  about *not* doing this on the actual fileservers.  Firstly ``kinit``
  and ``aklog`` to get tokens.

* Create the mirror volume.  See `Creating a Volume`_ for details.
  The volume should be named ``mirror.foo``, where `foo` is
  descriptive of the contents of the mirror.  Example::

    vos create afs01.dfw.openstack.org a mirror.foo

* Create read-only replicas of the volume.  One replica should be
  located on the same fileserver (it will take little to no additional
  space), and at least one other replica on a different fileserver.
  Example::

    vos addsite afs01.dfw.openstack.org a mirror.foo
    vos addsite afs02.dfw.openstack.org a mirror.foo

* Release the read-only replicas::

    vos release mirror.foo

  See the status of all volumes with::

    vos listvldb

When traversing from a read-only volume to another volume across a
mountpoint, AFS will first attempt to use a read-only replica of the
destination volume if one exists.  In order to naturally cause clients
to prefer our read-only paths for mirrors, the entire path up to that
point is composed of read-only volumes::

  /afs             [root.afs]
    /openstack.org [root.cell]
      /mirror      [mirror]
        /bar       [mirror.bar]

In order to mount the ``mirror.foo`` volume under ``mirror`` we need
to modify the read-write version of the ``mirror`` volume.  To make
this easy, the read-write version of the cell root is mounted at
``/afs/.openstack.org``.  Following the same logic from earlier,
traversing to paths below that mount point will generally prefer
read-write volumes.

* Mount the volume into afs using the read-write path::

    fs mkmount /afs/.openstack.org/mirror/foo mirror.foo

* Release the ``mirror`` volume so that the (currently empty) foo
  mirror itself appears in directory listings under
  ``/afs/openstack.org/mirror``::

    vos release mirror

* Create a principal for the mirror update process.  See
  :ref:`addprinc` for details.  The principal should be called
  ``service/foo-mirror``.  Example::

    kadmin: addprinc -randkey service/foo-mirror@OPENSTACK.ORG
    kadmin: ktadd -k /path/to/foo.keytab service/foo-mirror@OPENSTACK.ORG

  .. warning:: Each time ``ktadd`` is run, the key is rotated and
               previous keytabs are invalidated.

* Add the service principal's keytab to Ansible secrets.  Copy the
  binary key to ``bridge.openstack.org`` and then use ``hieraedit`` to
  update the files

  .. code-block:: console

    root@bridge:~# /home/zuul/src/opendev.org/opendev/system-config/tools/hieraedit.py \
      --yaml /etc/ansible/hosts/host_vars/mirror-update01.opendev.org.yaml \
      -f /path/to/foo.keytab KEYNAME

  (don't forget to ``git commit`` and save the change; you can remove
  the copies of the binary key too).  The key will be base64 encoded
  in the heira database.  If you need to examine it for some reason
  you can use ``base64``::

    cat /path/to/foo.keytab | base64

* Ensure the values in this new variable are written to disk as the
  keytab on ``mirror-update.opendev.org`` by adding it to the
  ``mirror-update`` role for the mirror scripts to use during update.
  You should check this with ``testinfra`` in
  ``testinfra/test_mirror-update.py`` (note this involves defining a
  "dummy" keytab for testing; see the other examples).

* Create an AFS user for the service principal::

    pts createuser service.foo-mirror

Because mirrors usually have a large number of directories, it is best
to avoid frequent ACL changes.  To this end, we grant access to the
mirror directories to a group where we can easily modify group
membership if our needs change.

* Create a group to contain the service principal, and add the
  principal::

    pts creategroup foo-mirror
    pts adduser service.foo-mirror foo-mirror

  View users, groups, and their membership with::

    pts listentries
    pts listentries -group
    pts membership foo-mirror

* Grant the group access to the mirror volume::

    fs setacl /afs/.openstack.org/mirror/foo foo-mirror write

* Grant anonymous users read access::

    fs setacl /afs/.openstack.org/mirror/foo system:anyuser read

* Set the quota on the volume (e.g., 100GB)::

    fs setquota /afs/.openstack.org/mirror/foo 100000000

Because the initial replication may take more time than we allocate in
our mirror update cron jobs, manually perform the first mirror update:

* In screen, obtain the lock on ``mirror-update01.opendev.org``::

    flock -n /var/run/foo-mirror/mirror.lock bash

  Leave that running while you perform the rest of the steps.

* Also in screen on ``mirror-update``, run the initial mirror sync.
  If using one of the mirror update scripts (from ``/usr/local/bin``)
  be aware that they generally run the update process under
  ``timeout`` with shorter periods than may be required for the
  initial full sync.  e.g. for ``reprepro`` mirrors

    /usr/local/bin/reprepro-mirror-update /etc/reprepro/ubuntu mirror.ubuntu

* Log into ``afs01.dfw.openstack.org`` and run ``screen``.  Within
  that session, periodically during the sync, and once again after it
  is complete, run::

    vos release mirror.foo -localauth

  It is important to do this from an AFS server using ``-localauth``
  rather than your own credentials and inside of screen because if
  ``vos release`` is interrupted, it will require some manual cleanup
  (data will not be corrupted, but clients will not see the new volume
  until it is successfully released).  Additionally, ``vos release`` has
  a bug where it will not use renewed tokens and so token expiration
  during a vos release may cause a similar problem.

* Once the initial sync and ``vos release`` are complete, release
  the lock file on mirror-update.

Removing a mirror
~~~~~~~~~~~~~~~~~

If you need to remove a mirror, you can do the following:

* Unmount the volume from the R/W location::

    fs rmmount /afs/.openstack.org/mirror/foo

* Release the R/O mirror volume to reflect the changes::

    vos release mirror

* Check what servers the volumes are on with ``vos listvldb``::

    VLDB entries for all servers
    ...

    mirror.foo
        RWrite: 536870934     ROnly: 536870935
        number of sites -> 3
           server afs01.dfw.openstack.org partition /vicepa RW Site
           server afs01.dfw.openstack.org partition /vicepa RO Site
           server afs01.ord.openstack.org partition /vicepa RO Site
     ...

* Remove the R/O replicas (you can also see these with ``vos
  listvol -server afs0[1|2].dfw.openstack.org``)::

    vos remove -server afs01.dfw.openstack.org -partition a -id mirror.foo.readonly
    vos remove -server afs02.dfw.openstack.org -partition a -id mirror.foo.readonly

* Remove the R/W volume::

    vos remove -server afs02.dfw.openstack.org -partition a -id mirror.foo

Reverse Proxy Cache
^^^^^^^^^^^^^^^^^^^

* `modules/openstack_project/templates/mirror.vhost.erb
  <https://opendev.org/opendev/system-config/src/branch/master/modules/openstack_project/templates/mirror.vhost.erb>`__

Each of the region-local mirror hosts exposes a limited reverse HTTP
proxy on port 8080.  These proxies run within the same Apache setup as
used to expose AFS mirror contents.  `mod_cache
<https://httpd.apache.org/docs/2.4/mod/mod_proxy.html>`__ is used to
expose a white-listed set of resources (currently just RDO).

Currently they will cache data for up to 24 hours (Apache default)
with pruning performed by ``htcacheclean`` once an hour to keep the
cache size at or under 2GB of disk space.

The reverse proxy is provided because there are some hosted resources
that are not currently able to be practically mirrored.  Examples of
this include RDO (rsync from RDO is slow and they update frequently)
and docker images (which require specialized software to run a docker
registry and then sorting out how to run that on a shared filesystem).

Apache was chosen because we already had configuration management in
place for Apache on these hosts.  This avoids management overheads of
a completely new service deployment such as Squid or a caching docker
registry daemon.

No Outage Server Maintenance
----------------------------

afsdb0X.openstack.org
~~~~~~~~~~~~~~~~~~~~~

We have redundant AFS DB servers. You can take one down without causing
a service outage as long as the others remain up. To do this safely::

  root@afsdb01:~# bos shutdown afsdb01.openstack.org -wait -localauth
  root@afsdb01:~# bos status afsdb01.openstack.org -localauth
  Instance ptserver, temporarily disabled, currently shutdown.
  Instance vlserver, temporarily disabled, currently shutdown.

Then perform your maintenance on afsdb01. When done a reboot will
automatically restart the bos service or you can manually restart
the openafs-fileserver service::

  root@afsdb01:~# service openafs-fileserver start

Finally check that the service is back up and running::

  root@afsdb01:~# bos status afsdb01.openstack.org -localauth
  Instance ptserver, currently running normally.
  Instance vlserver, currently running normally.

Now you can repeat the process against afsdb02 or afsdb03.

afs0X.openstack.org
~~~~~~~~~~~~~~~~~~~

Taking down the actual fileservers is slightly more complicated
but works similarly. Basically what we need to do is make sure that
either no one needs the RW volumes hosted by a fileserver before
taking it down or move the RW volume to another fileserver. When
taking down afs01.dfw.openstack.org we must also ensure that the
vos releases that are performed on it by mirror-update are stopped.

To ensure nothing needs the RW volumes you can hold the various
file locks on hosts that publish to AFS and/or remove cron entries
that perform vos releases or volume writes.

If instead you need to move the RW volume first step is checking
where the volumes live::

  root@afsdb01:~# vos listvldb -localauth
  VLDB entries for all servers

  mirror
      RWrite: 536870934     ROnly: 536870935
      number of sites -> 3
         server afs01.dfw.openstack.org partition /vicepa RW Site
         server afs01.dfw.openstack.org partition /vicepa RO Site
         server afs01.ord.openstack.org partition /vicepa RO Site

We see that if we want to allow write to the mirror volume and take
down afs01.dfw.openstack.org we will have to move the volume to one
of the other servers::

  root@afsdb01:~# screen # use screen as this may take quite some time.
  root@afsdb01:~# vos move -id mirror -toserver afs01.ord.openstack.org -topartition vicepa -fromserver afs01.dfw.openstack.org -frompartition vicepa -localauth

When that is done (use listvldb command above to check) it is now safe
to take down afs0X.dfw.openstack.org while having writers to the mirror
volume. If operating on afs01.dfw.openstack.org you should also hold
all mirror update locks and the release-volumes lock. This ensures
we do not interrupt any vos releases on afs01.dfw.openstack.org that are
run by mirror-update remotely. We use the same process as for the db server::

  root@afsdb01:~# bos shutdown afs01.dfw.openstack.org -localauth
  root@afsdb01:~# bos status afs01.dfw.openstack.org -localauth
  Auxiliary status is: file server shut down.

Perform maintenance, then restart as above and check the status again::

  root@afsdb01:~# bos status afs01.dfw.openstack.org -localauth
  Auxiliary status is: file server running.

DNS Entries
-----------

AFS uses the following DNS entries which indicate an even balance::

  _afs3-prserver._udp.openstack.org. 300 IN SRV 10 10 7002 afsdb01.openstack.org.
  _afs3-prserver._udp.openstack.org. 300 IN SRV 10 10 7002 afsdb02.openstack.org.
  _afs3-prserver._udp.openstack.org. 300 IN SRV 10 10 7002 afsdb03.openstack.org.
  _afs3-vlserver._udp.openstack.org. 300 IN SRV 10 10 7003 afsdb01.openstack.org.
  _afs3-vlserver._udp.openstack.org. 300 IN SRV 10 10 7003 afsdb02.openstack.org.
  _afs3-vlserver._udp.openstack.org. 300 IN SRV 10 10 7003 afsdb03.openstack.org.

Be sure to update them if volume location and PTS servers change. Also note
that only A (IPv4 address) records are used in the SRV data. Since OpenAFS
lacks support for IPv6, avoid entering corresponding AAAA (IPv6 address)
records for these so that it won't cause fallback delays for other
v6-supporting AFS client implementations.