octavia/doc/source/guides/operator-maintenance.rst

..
      Copyright (c) 2017 Rackspace

      Licensed under the Apache License, Version 2.0 (the "License"); you may
      not use this file except in compliance with the License. You may obtain
      a copy of the License at

          http://www.apache.org/licenses/LICENSE-2.0

      Unless required by applicable law or agreed to in writing, software
      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
      License for the specific language governing permissions and limitations
      under the License.

======================================
Operator Maintenance  Guide
======================================
This document is intended for operators. For a developer guide see the
:doc:`dev-quick-start` in this documentation repository. For an end-user
guide, please see the :doc:`basic-cookbook` in this documentation
repository.

Monitoring
==========


Monitoring Load Balancer Amphora
--------------------------------
Octavia will monitor the load balancing amphorae itself and initiate failovers
and/or replacements if they malfunction. Therefore, most installations won't
need to monitor the amphorae running the load balancer.

Octavia will log each failover to the corresponding health manager logs. It is
advisable to use log analytics to monitor failover trends to notice problems
in the OpenStack installation early. We have seen neutron (network)
connectivity issues, Denial of Service attacks, and nova (compute)
malfunctions lead to a higher than normal failover rate. Alternatively, the
monitoring of the other services showed problems as well, so depending on
your overall monitoring strategy this might be optional.

If additional monitoring is necessary, review the corresponding calls on
the amphora agent REST interface (see :doc:`../api/haproxy-amphora-api`)

Monitoring Pool Members
-----------------------

Octavia will use the health information from the underlying load balancing
application to determine the health of members. This information will be
streamed to the Octavia database and made available via the status
tree or other API methods. For critical applications we recommend to
poll this information in regular intervals.

Monitoring load balancer functionality
--------------------------------------

For production sites we recommend to use outside monitoring services. They
will use servers distributed around the globe to not only monitor if the site
is up but also parts of the system outside the visibility of Octavia like
routers, network connectivity, etc.

.. _Monasca Octavia plugin: https://github.com/openstack/monasca-agent/blob/master/monasca_setup/detection/plugins/octavia.py

Monitoring Octavia Control Plane
--------------------------------

To monitor the Octavia control plane we recommend process monitoring of the
main Octavia processes:

* octavia-api

* octavia-worker

* octavia-health-manager

* octavia-housekeeping

The Monasca project has a plugin for such monitoring (see
`Monasca Octavia plugin`_).
Please refer to this project for further information.

Octavia's control plane components are shared nothing and can be scaled
lineary. For high availability of the control plane we recommend to run at
least one set of components in each availability zone. Furthermore, the
octavia-api endpoint could be behind a load balancer or other HA technology.
That said, if one or more components fail the system will still be available
(though potentially degraded). For instance if you have installed one set of
components in three availability zones even if you lose a whole zone
Octavia will still be responsive and available - only if you lose the
Octavia control plane in all three zones will the service be unavailable.
Please note this only addresses control plane availability; the availability
of the load balancing function depends highly on the chosen topology and the
anti-affinity settings. See our forthcoming HA guide for more details.

Additionally, we recommend to monitor the Octavia API endpoint(s). There
currently is no special url to use so just polling the root URL in regular
intervals is sufficient.

There is a host of information in the log files which can be used for log
analytics. A few examples of what could be monitored are:

* Amphora Build Rate - to determine load of the system

* Amphora Build Time - to determine how long it takes to build an amphora

* Failures/Errors - to be notified of system problems early

.. _rotating_amphora:

Rotating the Amphora Images
===========================

Octavia will start load balancers with a pre-built image which contain the
amphora agent, a load balancing application, and are seeded with cryptographic
certificates through the config drive at start up.

Rotating the image means making a load balancer amphora running with an old
image failover to an amphora with a new image. This should be without any
measurable interruption in the load balancing functionality when using
ACTIVE/STANDBY topology. Standalone load balancers might experience a short
outage.

Here are some reasons you might need to rotate the amphora image:

* There has been a (security) update to the underlying operating system

* You want to deploy a new version of the amphora agent or haproxy

* The cryptographic certificates and/or keys on the amphora have been
  compromised.

* Though not related to rotating images, this procedure might be invoked if you
  are switching to a different flavor for the underlying virtual machine.

Preparing a New Amphora Image
-----------------------------

To prepare a new amphora image you will need to use diskimage-create.sh as
described in the README in the diskimage-create directory.

For instance, in the ``octavia/diskimage-create`` directory, run:

   .. code-block:: bash

     ./diskimage-create.sh

Once you have created a new image you will need to upload it into glance. The
following shows how to do this if you have set the image tag in the
Octavia configuration file. Make sure to use a user with the same tenant as
the Octavia service account:

 .. code-block:: bash

      openstack image create --file amphora-x64-haproxy.qcow2 \
      --disk-format qcow2 --tag <amphora-image-tag> --private \
      --container-format bare /var/lib/octavia/amphora-x64-haproxy.qcow2

If you didn't configure image tags and instead configured an image id, you
will need to update the Octavia configuration file with the new id and restart
the Octavia services (except octavia-api).

Generating a List of Amphorae to Rotate
---------------------------------------

The easiest way to generate a list is to use nova to list all the amphorae:

 .. code-block:: bash

        openstack server list --name amphora* --all -c ID -c Status -c Networks

Take note of the amphorae IDs and IPs on the management network.

If you are using an ACTIVE-STANDBY topology it might be beneficial to rotate
first the BACKUP amphora before the ACTIVE one. In this case you will need
to use the Octavia database to query for that:

    .. code-block:: bash

        mysql octavia -e 'select id, compute_id, lb_network_ip, role from amphora where status="ALLOCATED" or status="READY";'

Take note of the compute ids, the role (ACTIVE, BACKUP), and the ip on the
management network. You can also find this IP either via
``openstack server show <id>`` or
``openstack server list --all -c ID -c Status -c Networks | grep <id>``.

Rotate an Amphora
-----------------

The idea is to force Octavia to start up a new amphora because Octavia thinks
this one has failed. The most graceful way to do this is to shut down the port
on the management network.

Use the ip on the management newtork to find the port-id:

    .. code-block:: bash

        openstack port list | grep <ip on mgmt net>

Take note of the port-id and initiate a failover by shutting down this port:

     .. code-block:: bash

        openstack port update --admin-state-up False <port-id>

You can observe the failover by querying nova ``openstack server list --all |
grep <id>`` until the server isn't found any longer.

.. _best_practice:

Best Practices/Optimizations
----------------------------

To speed up the failovers, the spare pool can be temporarily increased to
accommodate the rapid failover of the amphora. In this case after the
new image has been loaded into glance, shut down or initiate a failover of the
amphora in the spare pool. They can be found, for instance, by looking for the
servers in ``openstack server list --all`` who only have an ip on the
management network assigned but not any tenant network. Alternatively, use this
database query:


    .. code-block:: bash

        mysql octavia -e 'select id, compute_id, lb_network_ip from amphora where status="READY";'


After you have increased the spare pool size and restarted all Octavia
services, failovers will be greatly accelerated. To preserve resources,
restore the old settings and restart the Octavia services. Since Octavia won't
terminate superfluous spare amphora on its own, they can be left in the system
and will automatically be used up as new load balancers are created and/or
load balancers in error state are failed over.

.. warning::
    If you are using the anti-affinity feature please be aware that it is
    not compatible with spare pools and you are risking both the ACTIVE and
    BACKUP amphora being scheduled on the same host. It is recommended to
    not increase the spare pool during fail overs in this case (and not to use
    the spare pool at all).

Since a failover puts significant load on the OpenStack installation by
creating new virtual machines and ports, it should either be done at a very
slow pace, during a time with little load, or with the right thottling
enabled in Octavia. The throttling will make sure to prioritize failovers
higher than other operations and depending on how many failovers are
initiated this might crowd out other operations.

.. note::
    In Pike a failover command is being added to the API which allows to failover
    a load balancer's amphora while taking care of the intricacies of different
    topologies and prioritizes administrative failovers behind other operations.
    This function should be used instead of the ones described above once it
    becomes available.

Rotating Cryptographic Certificates
===================================

Octavia secures the communication between the amphora agent and the control
plane with two-way SSL encryption. To accomplish that, several certificates
are distributed in the system:

* Control plane:

  * Amphora certificate authority (CA) certificate: Used to validate
    amphora certificates if Octavia acts as a Certificate Authority to
    issue new amphora certificates

  * Client certificate: Used to authenticate with the amphora

* Amphora:

  * Client CA certificate: Used to validate control plane
    client certificate

  * Amphora certificate: Presented to control plane processes to prove amphora
    identity.

The heartbeat UDP packets emitted from the amphora are secured with a
symmetric encryption key. This is set by the configuration option
`heartbeat_key` in the `health_manager` section. We recommend setting it to a
random string of a sufficient length.

.. _rotate-amphora-certs:

Rotating Amphora Certificates
-----------------------------

For the server part Octavia will either act as a certificate authority itself,
or use :doc:`../main/Anchor` to issue amphora certificates to be used
by each amphora. Octavia will also monitor those certificates and refresh them
before they expire.

There are three ways to initiate a rotation manually:

* Change the expiration date of the certificate in the database. Octavia
  will then rotate the amphora certificates with newly issued ones. This
  requires the following:

  * Client CA certificate hasn't expired or the
    corresponding client certificate on the control plane hasn't been issued by
    a different client CA (in case the authority was
    compromised)

  * The Amphora CA certificate on the control plane didn't
    change in any way which jeopardizes validation of the amphora certificate
    (e.g. the certificate was reissued with a new private/public key)

* If the amphora CA changed in a way which jeopardizes
  validation of the amphora certificate an operator can manually upload newly
  issued amphora certificates by switching off validation of the old amphora
  certificate. This requires a client certificate which can be validated by the
  client CA file on the amphora. Refer to
  :doc:`../api/haproxy-amphora-api` for more details.

* If the client certificate on the control plane changed in a way that it can't
  be validated by the client certificate authority certificate on the amphora,
  a failover (see :ref:`rotate-amphora-certs`) of all amphorae needs to be
  initiated. Until the failover is completed the amphorae can't be controlled
  by the control plane.

Rotating the Certificate Authority Certificates
-----------------------------------------------

If there is a compromise of the certificate authorities' certificates, or they
expired, new ones need to be installed into the system. If Octavia is
not acting as the certificate authority only the certificate authority's
cert needs to be changed in the system so amphora can be authenticated again.

# Issue new certificates (see the script in the bin folder of Octavia if
Octavia is acting as the certificate authority) or follow the instructions
of the third-party certificate authority. Copy the certificate and the
private key (if Octavia acts as a certificate authority) where Octavia can
find them.

# If the previous certificate files haven't been overridden, adjust the paths
to the new certs in the configuration file and restart all Octavia services
(except octavia-api).

# Review :ref:`_rotate-amphora-certs` above to determine if and how the
amphora certificates needs to be rotated.

Rotating Client Certificates
----------------------------

If the client certificates expired new ones need to be issued and installed on
the system:

# Issue a new client certificate (see the script in the bin folder of Octavia
if self signed certificates are used) or use the ones provided to you by
your certificate authority.

# Copy the new cert where Octavia can find it.

# If the previous certificate files haven't been overridden, adjust the paths
to the new certs in the configuration file. In all cases restart all Octavia
services except octavia-api.

If the client CA certificate has been replaced in addition to
rotating the client certificate the new client CA
certificate needs to be installed in the system. After that initiate a
failover of all amphorae to distribute the new client CA
cert. Until the failover is completed the amphorae can't be controlled by the
control plane.

Changing The Heartbeat Encryption Key
-------------------------------------

Special caution needs to be taken to replace the heartbeat encryption key.
Once this is changed Octavia can't read any heartbeats and will assume
all amphora are in an error state and initiate an immediate failover.

In preparation, read the chapter on :ref:`best_practice` in
the Failover section. In particular, it is advisable if the throttling
enhancement (available in Pike) doesn't exist to create a sufficient
number of spare amphorae to mitigate the stress on the OpenStack installation
when Octavia starts to replace all amphora immediately.

Given the risks involved with changing this key it should not be changed
during routine maintenance but only when a compromise is strongly suspected.

.. note::
   For future versions of Octavia an "update amphora" API is planned which
   will allow this key to be changed without failover. At that time there would
   be a procedure to halt health monitoring while the keys are rotated and then
   resume health monitoring.