Merge "Failover stop threshold / circuit breaker"

2022-09-02 20:33:19 +00:00 · 2022-09-02 20:33:19 +00:00 · 93a2b0a0c3
commit 93a2b0a0c3
parent a1c27a37b4 1d19b702b1
11 changed files with 388 additions and 21 deletions
--- a/doc/source/admin/failover-circuit-breaker.rst
+++ b/doc/source/admin/failover-circuit-breaker.rst
@ -0,0 +1,131 @@
+..
+      Copyright Red Hat
+
+      Licensed under the Apache License, Version 2.0 (the "License"); you may
+      not use this file except in compliance with the License. You may obtain
+      a copy of the License at
+
+          http://www.apache.org/licenses/LICENSE-2.0
+
+      Unless required by applicable law or agreed to in writing, software
+      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+      License for the specific language governing permissions and limitations
+      under the License.
+
+========================================
+Octavia Amphora Failover Circuit Breaker
+========================================
+
+During a large infrastructure outage, the automatic failover of stale
+amphorae can lead to a mass failover event and create a considerable
+amount of extra load on servers. By using the amphora failover
+circuit breaker feature, you can avoid these unwanted failover events.
+The circuit breaker is a configurable threshold value that you can set,
+and will stop amphorae from automatically failing over whenever that
+threshold value is met. The circuit breaker feature is disabled by default.
+
+Configuration
+=============
+
+You define the threshold value for the failover circuit breaker feature
+by setting the *failover_threshold* variable. The *failover_threshold*
+variable is a member of the *health_manager* group within the
+configuration file ``/etc/octavia/octavia.conf``.
+
+Whenever the number of stale amphorae reaches or surpasses the value
+of *failover_threshold*, Octavia performs the following actions:
+
+* stops automatic failovers of amphorae.
+* sets the status of the stale amphorae to *FAILOVER_STOPPED*.
+* logs an error message.
+
+The line below shows a typical error message:
+
+.. code-block:: bash
+
+    ERROR octavia.db.repositories [-] Stale amphora count reached the threshold (3). 4 amphorae were set into FAILOVER_STOPPED status.
+
+.. note:: Base the value that you set for *failover_threshold* on the
+    size of your environment. We recommend that you set the value to a number
+    greater than the typical number of amphorae that you estimate to run on a
+    single host, or to a value that reflects between 20% and 30%
+    of the total number of amphorae.
+
+Error Recovery
+==============
+
+Automatic Error Recovery
+------------------------
+
+For amphorae whose status is *FAILOVER_STOPPED*, Octavia will
+automatically reset their status to *ALLOCATED* after receiving
+new updates from these amphorae.
+
+Manual Error Recovery
+---------------------
+
+To recover from the *FAILOVER_STOPPED* condition, you must
+manually reduce the value of the stale amphorae below the
+circuit breaker threshold.
+
+You can use the ``openstack loadbalancer amphora list`` command
+to list the amphorae that are in *FAILOVER_STOPPED* state.
+Use the ``openstack loadbalancer amphora failover`` command to
+manually trigger the amphora to failover.
+
+In this example, *failover_threshold = 3* and an infrastructure
+outage caused four amphorae to become unavailable. After the
+health manager process detects this state, it sets the status
+of all stale amphorae to *FAILOVER_STOPPED* as shown below.
+
+.. code-block:: bash
+
+    openstack loadbalancer amphora list
+    +--------------------------------------+--------------------------------------+------------------+--------+---------------+------------+
+    | id                                   | loadbalancer_id                      | status           | role   | lb_network_ip | ha_ip      |
+    +--------------------------------------+--------------------------------------+------------------+--------+---------------+------------+
+    | 79f0e06d-446d-448a-9d2b-c3b89d0c700d | 8fd2cac5-cbca-4bb1-bcfc-daba43e097ab | FAILOVER_STOPPED | BACKUP | 192.168.0.108 | 192.0.2.17 |
+    | 9c0416d7-6293-4f13-8f67-61e5d757b36e | 4b13dda1-296a-400c-8248-1abad5728057 | ALLOCATED        | MASTER | 192.168.0.198 | 192.0.2.42 |
+    | e11208b7-f13d-4db3-9ded-1ee6f70a0502 | 8fd2cac5-cbca-4bb1-bcfc-daba43e097ab | FAILOVER_STOPPED | MASTER | 192.168.0.154 | 192.0.2.17 |
+    | ceea9fff-71a2-48c8-a968-e51dc440c572 | ab513cb3-8f5d-461e-b7ae-a06b5083a371 | ALLOCATED        | MASTER | 192.168.0.149 | 192.0.2.26 |
+    | a1351933-2270-493c-8201-d8f9f9fe42f7 | 4b13dda1-296a-400c-8248-1abad5728057 | FAILOVER_STOPPED | BACKUP | 192.168.0.103 | 192.0.2.42 |
+    | 441718e7-0956-436b-9f99-9a476339d7d2 | ab513cb3-8f5d-461e-b7ae-a06b5083a371 | FAILOVER_STOPPED | BACKUP | 192.168.0.148 | 192.0.2.26 |
+    +--------------------------------------+--------------------------------------+------------------+--------+---------------+------------+
+
+After operators have resolved the infrastructure outage,
+they might need to manually trigger failovers to return to
+normal operation. In this example, two manual failovers are
+necessary to get the number of stale amphorae below the
+configured threshold of three:
+
+.. code-block:: bash
+
+    openstack loadbalancer amphora failover --wait 79f0e06d-446d-448a-9d2b-c3b89d0c700d
+    openstack loadbalancer amphora list
+    +--------------------------------------+--------------------------------------+------------------+--------+---------------+------------+
+    | id                                   | loadbalancer_id                      | status           | role   | lb_network_ip | ha_ip      |
+    +--------------------------------------+--------------------------------------+------------------+--------+---------------+------------+
+    | 9c0416d7-6293-4f13-8f67-61e5d757b36e | 4b13dda1-296a-400c-8248-1abad5728057 | ALLOCATED        | MASTER | 192.168.0.198 | 192.0.2.42 |
+    | e11208b7-f13d-4db3-9ded-1ee6f70a0502 | 8fd2cac5-cbca-4bb1-bcfc-daba43e097ab | FAILOVER_STOPPED | MASTER | 192.168.0.154 | 192.0.2.17 |
+    | ceea9fff-71a2-48c8-a968-e51dc440c572 | ab513cb3-8f5d-461e-b7ae-a06b5083a371 | ALLOCATED        | MASTER | 192.168.0.149 | 192.0.2.26 |
+    | a1351933-2270-493c-8201-d8f9f9fe42f7 | 4b13dda1-296a-400c-8248-1abad5728057 | FAILOVER_STOPPED | BACKUP | 192.168.0.103 | 192.0.2.42 |
+    | 441718e7-0956-436b-9f99-9a476339d7d2 | ab513cb3-8f5d-461e-b7ae-a06b5083a371 | FAILOVER_STOPPED | BACKUP | 192.168.0.148 | 192.0.2.26 |
+    | cf734b57-6019-4ec0-8437-115f76d1bbb0 | 8fd2cac5-cbca-4bb1-bcfc-daba43e097ab | ALLOCATED        | BACKUP | 192.168.0.141 | 192.0.2.17 |
+    +--------------------------------------+--------------------------------------+------------------+--------+---------------+------------+
+    openstack loadbalancer amphora failover --wait e11208b7-f13d-4db3-9ded-1ee6f70a0502
+    openstack loadbalancer amphora list
+    +--------------------------------------+--------------------------------------+-----------+--------+---------------+------------+
+    | id                                   | loadbalancer_id                      | status    | role   | lb_network_ip | ha_ip      |
+    +--------------------------------------+--------------------------------------+-----------+--------+---------------+------------+
+    | 9c0416d7-6293-4f13-8f67-61e5d757b36e | 4b13dda1-296a-400c-8248-1abad5728057 | ALLOCATED | MASTER | 192.168.0.198 | 192.0.2.42 |
+    | ceea9fff-71a2-48c8-a968-e51dc440c572 | ab513cb3-8f5d-461e-b7ae-a06b5083a371 | ALLOCATED | MASTER | 192.168.0.149 | 192.0.2.26 |
+    | cf734b57-6019-4ec0-8437-115f76d1bbb0 | 8fd2cac5-cbca-4bb1-bcfc-daba43e097ab | ALLOCATED | BACKUP | 192.168.0.141 | 192.0.2.17 |
+    | d2909051-402e-4e75-86c9-ec6725c814a1 | 8fd2cac5-cbca-4bb1-bcfc-daba43e097ab | ALLOCATED | MASTER | 192.168.0.25  | 192.0.2.17 |
+    | 5133e01a-fb53-457b-b810-edbb5202437e | 4b13dda1-296a-400c-8248-1abad5728057 | ALLOCATED | BACKUP | 192.168.0.76  | 192.0.2.42 |
+    | f82eff89-e326-4e9d-86bc-58c720220a3f | ab513cb3-8f5d-461e-b7ae-a06b5083a371 | ALLOCATED | BACKUP | 192.168.0.86  | 192.0.2.26 |
+    +--------------------------------------+--------------------------------------+-----------+--------+---------------+------------+
+
+After the number of stale amphorae falls below the configured
+threshold value, normal operation resumes and the automatic
+failover process attempts to restore the remaining stale amphorae.
--- a/doc/source/admin/index.rst
+++ b/doc/source/admin/index.rst
@ -36,6 +36,7 @@ Optional Installation and Configuration Guides
   healthcheck.rst
   flavors.rst
   apache-httpd.rst
+   failover-circuit-breaker.rst

 Maintanence and Operations
 --------------------------
--- a/etc/octavia.conf
+++ b/etc/octavia.conf
@ -128,6 +128,11 @@
 # heartbeat_timeout = 60
 # health_check_interval = 3
 # sock_rlimit = 0
+# Stop failovers if the count of simultaneously failed
+# amphora reaches this number (circuit breaker). This may prevent large
+# scale accidental failover events, like in the case of
+# network failures or read-only database issues.
+# failover_threshold =

 [keystone_authtoken]
 # This group of config options are imported from keystone middleware. Thus the
--- a/octavia/common/config.py
+++ b/octavia/common/config.py
@ -304,6 +304,11 @@ health_manager_opts = [
               help=_('Sleep time between health checks in seconds.')),
    cfg.IntOpt('sock_rlimit', default=0,
               help=_(' sets the value of the heartbeat recv buffer')),
+    cfg.IntOpt('failover_threshold', default=None,
+               help=_('Stop failovers if the count of simultaneously failed '
+                      'amphora reaches this number. This may prevent large '
+                      'scale accidental failover events, like in the case of '
+                      'network failures or read-only database issues.')),

    # Used by the health manager on the amphora
    cfg.ListOpt('controller_ip_port_list',
--- a/octavia/common/constants.py
+++ b/octavia/common/constants.py
@ -161,6 +161,8 @@ AMPHORA_ALLOCATED = lib_consts.AMPHORA_ALLOCATED
 AMPHORA_BOOTING = lib_consts.AMPHORA_BOOTING
 # Amphora is ready to be allocated to a load balancer 'READY'
 AMPHORA_READY = lib_consts.AMPHORA_READY
+# 'FAILOVER_STOPPED'. Failover threshold level has been reached.
+AMPHORA_FAILOVER_STOPPED = lib_consts.AMPHORA_FAILOVER_STOPPED
 # 'ACTIVE'
 ACTIVE = lib_consts.ACTIVE
 # 'PENDING_DELETE'
@ -248,13 +250,6 @@ MUTABLE_STATUSES = (lib_consts.ACTIVE,)
 DELETABLE_STATUSES = (lib_consts.ACTIVE, lib_consts.ERROR)
 FAILOVERABLE_STATUSES = (lib_consts.ACTIVE, lib_consts.ERROR)

-# Note: The database Amphora table has a foreign key constraint against
-#       the provisioning_status table
-SUPPORTED_AMPHORA_STATUSES = (
-    lib_consts.AMPHORA_ALLOCATED, lib_consts.AMPHORA_BOOTING, lib_consts.ERROR,
-    lib_consts.AMPHORA_READY, lib_consts.DELETED, lib_consts.PENDING_CREATE,
-    lib_consts.PENDING_DELETE)
-
 AMPHORA_VM = 'VM'
 SUPPORTED_AMPHORA_TYPES = (AMPHORA_VM,)

--- a/octavia/controller/healthmanager/health_manager.py
+++ b/octavia/controller/healthmanager/health_manager.py
@ -92,7 +92,6 @@ class HealthManager(object):
            lock_session = None
            try:
                lock_session = db_api.get_session(autocommit=False)
-                amp = None
                amp_health = self.amp_health_repo.get_stale_amphora(
                    lock_session)
                if amp_health:
--- a/octavia/db/migration/alembic_migrations/versions/0995c26fc506_add_failover_stopped_to_provisioning_.py
+++ b/octavia/db/migration/alembic_migrations/versions/0995c26fc506_add_failover_stopped_to_provisioning_.py
@ -0,0 +1,41 @@
+# Copyright Red Hat
+#
+# Licensed under the Apache License, Version 2.0 (the "License"); you may
+# not use this file except in compliance with the License. You may obtain
+# a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+# License for the specific language governing permissions and limitations
+# under the License.
+"""Add FAILOVER_STOPPED to provisioning_status table
+
+Revision ID: 0995c26fc506
+Revises: 31f7653ded67
+Create Date: 2022-03-24 04:53:10.768658
+
+"""
+from alembic import op
+import sqlalchemy as sa
+
+# revision identifiers, used by Alembic.
+revision = '0995c26fc506'
+down_revision = '31f7653ded67'
+
+
+def upgrade():
+    insert_table = sa.sql.table(
+        'provisioning_status',
+        sa.sql.column('name', sa.String),
+        sa.sql.column('description', sa.String)
+    )
+
+    op.bulk_insert(
+        insert_table,
+        [
+            {'name': 'FAILOVER_STOPPED'},
+        ]
+    )
--- a/octavia/db/repositories.py
+++ b/octavia/db/repositories.py
@ -19,6 +19,7 @@ reference
 """

 import datetime
+from typing import Optional

 from oslo_config import cfg
 from oslo_db import api as oslo_db_api
@ -28,9 +29,12 @@ from oslo_serialization import jsonutils
 from oslo_utils import excutils
 from oslo_utils import uuidutils
 from sqlalchemy.orm import noload
+from sqlalchemy.orm import Session
 from sqlalchemy.orm import subqueryload
+from sqlalchemy import select
 from sqlalchemy.sql.expression import false
 from sqlalchemy.sql import func
+from sqlalchemy import update

 from octavia.common import constants as consts
 from octavia.common import data_models
@ -1678,28 +1682,99 @@ class AmphoraHealthRepository(BaseRepository):
        # In this case, the amphora is expired.
        return amphora_model is None

-    def get_stale_amphora(self, session):
+    def get_stale_amphora(self,
+                          lock_session: Session) -> Optional[models.Amphora]:
        """Retrieves a stale amphora from the health manager database.

-        :param session: A Sql Alchemy database session.
+        :param lock_session: A Sql Alchemy database autocommit session.
        :returns: [octavia.common.data_model]
        """
-
        timeout = CONF.health_manager.heartbeat_timeout
        expired_time = datetime.datetime.utcnow() - datetime.timedelta(
            seconds=timeout)

-        amp = session.query(self.model_class).with_for_update().filter_by(
-            busy=False).filter(
-            self.model_class.last_update < expired_time).order_by(
-            func.random()).first()
+        # Update any amphora that were previously FAILOVER_STOPPED
+        # but are no longer expired.
+        self.update_failover_stopped(lock_session, expired_time)

-        if amp is None:
+        # Handle expired amphora
+        expired_ids_query = select(self.model_class.amphora_id).where(
+            self.model_class.busy == false()).where(
+                self.model_class.last_update < expired_time)
+
+        expired_count = lock_session.scalar(
+            select(func.count()).select_from(expired_ids_query))
+
+        threshold = CONF.health_manager.failover_threshold
+        if threshold is not None and expired_count >= threshold:
+            LOG.error('Stale amphora count reached the threshold '
+                      '(%(th)s). %(count)s amphorae were set into '
+                      'FAILOVER_STOPPED status.',
+                      {'th': threshold, 'count': expired_count})
+            lock_session.execute(
+                update(
+                    models.Amphora
+                ).where(
+                    models.Amphora.status.notin_(
+                        [consts.DELETED, consts.PENDING_DELETE])
+                ).where(
+                    models.Amphora.id.in_(expired_ids_query)
+                ).values(
+                    status=consts.AMPHORA_FAILOVER_STOPPED
+                ).execution_options(synchronize_session="fetch"))
            return None

-        amp.busy = True
+        # We don't want to attempt to failover amphora that are not
+        # currently in the ALLOCATED or FAILOVER_STOPPED state.
+        # i.e. Not DELETED, PENDING_*, etc.
+        allocated_amp_ids_subquery = (
+            select(models.Amphora.id).where(
+                models.Amphora.status.in_(
+                    [consts.AMPHORA_ALLOCATED,
+                     consts.AMPHORA_FAILOVER_STOPPED])))

-        return amp.to_data_model()
+        # Pick one expired amphora for automatic failover
+        amp_health = lock_session.query(
+            self.model_class
+        ).with_for_update(
+        ).filter(
+            self.model_class.amphora_id.in_(expired_ids_query)
+        ).filter(
+            self.model_class.amphora_id.in_(allocated_amp_ids_subquery)
+        ).order_by(
+            func.random()
+        ).limit(1).first()
+
+        if amp_health is None:
+            return None
+
+        amp_health.busy = True
+
+        return amp_health.to_data_model()
+
+    def update_failover_stopped(self, lock_session: Session,
+                                expired_time: datetime) -> None:
+        """Updates the status of amps that are FAILOVER_STOPPED."""
+        # Update any FAILOVER_STOPPED amphora that are no longer stale
+        # back to ALLOCATED.
+        # Note: This uses sqlalchemy 2.0 syntax
+        not_expired_ids_subquery = (
+            select(self.model_class.amphora_id).where(
+                self.model_class.busy == false()
+            ).where(
+                self.model_class.last_update >= expired_time
+            ))
+
+        # Note: mysql and sqlite do not support RETURNING, so we cannot
+        #       get back the affected amphora IDs. (09/2022)
+        lock_session.execute(
+            update(models.Amphora).where(
+                models.Amphora.status == consts.AMPHORA_FAILOVER_STOPPED
+            ).where(
+                models.Amphora.id.in_(not_expired_ids_subquery)
+            ).values(
+                status=consts.AMPHORA_ALLOCATED
+            ).execution_options(synchronize_session="fetch"))


 class VRRPGroupRepository(BaseRepository):
--- a/octavia/tests/functional/db/test_repositories.py
+++ b/octavia/tests/functional/db/test_repositories.py
@ -4144,12 +4144,31 @@ class AmphoraRepositoryTest(BaseRepositoryTest):
 class AmphoraHealthRepositoryTest(BaseRepositoryTest):
    def setUp(self):
        super().setUp()
+        self._fake_ip_gen = (self.FAKE_IP + str(ip_end) for ip_end in
+                             range(100))
        self.amphora = self.amphora_repo.create(self.session,
                                                id=self.FAKE_UUID_1,
                                                compute_id=self.FAKE_UUID_3,
                                                status=constants.ACTIVE,
                                                lb_network_ip=self.FAKE_IP)

+    def create_amphora(self, amphora_id, **overrides):
+        fake_ip = next(self._fake_ip_gen)
+        settings = {
+            'id': amphora_id,
+            'compute_id': uuidutils.generate_uuid(),
+            'status': constants.ACTIVE,
+            'lb_network_ip': fake_ip,
+            'vrrp_ip': fake_ip,
+            'ha_ip': fake_ip,
+            'role': constants.ROLE_MASTER,
+            'cert_expiration': datetime.datetime.utcnow(),
+            'cert_busy': False
+        }
+        settings.update(overrides)
+        amphora = self.amphora_repo.create(self.session, **settings)
+        return amphora
+
    def create_amphora_health(self, amphora_id):
        newdate = datetime.datetime.utcnow() - datetime.timedelta(minutes=10)

@ -4216,10 +4235,94 @@ class AmphoraHealthRepositoryTest(BaseRepositoryTest):
            self.session)
        self.assertIsNone(stale_amphora)

-        self.create_amphora_health(self.amphora.id)
+        uuid = uuidutils.generate_uuid()
+        self.create_amphora(uuid)
+        self.amphora_repo.update(self.session, uuid,
+                                 status=constants.AMPHORA_ALLOCATED)
+        self.create_amphora_health(uuid)
        stale_amphora = self.amphora_health_repo.get_stale_amphora(
            self.session)
-        self.assertEqual(self.amphora.id, stale_amphora.amphora_id)
+        self.assertEqual(uuid, stale_amphora.amphora_id)
+
+    def test_get_stale_amphora_past_threshold(self):
+        conf = self.useFixture(oslo_fixture.Config(cfg.CONF))
+        conf.config(group='health_manager', failover_threshold=3)
+
+        stale_amphora = self.amphora_health_repo.get_stale_amphora(
+            self.session)
+        self.assertIsNone(stale_amphora)
+
+        # Two stale amphora expected, should return that amp
+        # These will go into failover and be marked "busy"
+        uuids = []
+        for _ in range(2):
+            uuid = uuidutils.generate_uuid()
+            uuids.append(uuid)
+            self.create_amphora(uuid)
+            self.amphora_repo.update(self.session, uuid,
+                                     status=constants.AMPHORA_ALLOCATED)
+            self.create_amphora_health(uuid)
+            stale_amphora = self.amphora_health_repo.get_stale_amphora(
+                self.session)
+            self.assertIn(stale_amphora.amphora_id, uuids)
+
+        # Creating more stale amphorae should return no amps (past threshold)
+        stale_uuids = []
+        for _ in range(4):
+            uuid = uuidutils.generate_uuid()
+            stale_uuids.append(uuid)
+            self.create_amphora(uuid)
+            self.amphora_repo.update(self.session, uuid,
+                                     status=constants.AMPHORA_ALLOCATED)
+            self.create_amphora_health(uuid)
+        stale_amphora = self.amphora_health_repo.get_stale_amphora(
+            self.session)
+        self.assertIsNone(stale_amphora)
+        num_fo_stopped = self.session.query(db_models.Amphora).filter(
+            db_models.Amphora.status == constants.AMPHORA_FAILOVER_STOPPED
+        ).count()
+        # Note that the two amphora started failover, so are "busy" and
+        # should not be marked FAILOVER_STOPPED.
+        self.assertEqual(4, num_fo_stopped)
+
+        # One recovered, but still over threshold
+        # Two "busy", One fully healthy, three in FAILOVER_STOPPED
+        amp = self.session.query(db_models.AmphoraHealth).filter_by(
+            amphora_id=stale_uuids[2]).first()
+        amp.last_update = datetime.datetime.utcnow()
+        self.session.flush()
+        stale_amphora = self.amphora_health_repo.get_stale_amphora(
+            self.session)
+        self.assertIsNone(stale_amphora)
+        num_fo_stopped = self.session.query(db_models.Amphora).filter(
+            db_models.Amphora.status == constants.AMPHORA_FAILOVER_STOPPED
+        ).count()
+        self.assertEqual(3, num_fo_stopped)
+
+        # Another one recovered, now below threshold
+        # Two are "busy", Two are fully healthy, Two are in FAILOVER_STOPPED
+        amp = self.session.query(db_models.AmphoraHealth).filter_by(
+            amphora_id=stale_uuids[3]).first()
+        amp.last_update = datetime.datetime.utcnow()
+        stale_amphora = self.amphora_health_repo.get_stale_amphora(
+            self.session)
+        self.assertIsNotNone(stale_amphora)
+        num_fo_stopped = self.session.query(db_models.Amphora).filter(
+            db_models.Amphora.status == constants.AMPHORA_FAILOVER_STOPPED
+        ).count()
+        self.assertEqual(2, num_fo_stopped)
+
+        # After error recovery all amps should be allocated again
+        now = datetime.datetime.utcnow()
+        for amp in self.session.query(db_models.AmphoraHealth).all():
+            amp.last_update = now
+        stale_amphora = self.amphora_health_repo.get_stale_amphora(
+            self.session)
+        self.assertIsNone(stale_amphora)
+        num_allocated = self.session.query(db_models.Amphora).filter(
+            db_models.Amphora.status == constants.AMPHORA_ALLOCATED
+        ).count()
+        self.assertEqual(5, num_allocated)

    def test_create(self):
        amphora_health = self.create_amphora_health(self.FAKE_UUID_1)
--- a/releasenotes/notes/failover-threshold-f5cdf2bbe8a64d6d.yaml
+++ b/releasenotes/notes/failover-threshold-f5cdf2bbe8a64d6d.yaml
@ -0,0 +1,12 @@
+---
+features:
+  - |
+    A new configuration option ``failover_threshold`` can be set to limit the
+    number of amphorae simultaneously pending failover before halting the
+    automatic failover process. This should help prevent unwanted mass failover
+    events that can happen in cases like network interruption to an AZ or the
+    database becoming read-only. This feature is not enabled by default, and it
+    should be configured carefully based on the size of the environment.
+    For example, with 100 amphorae a good threshold might be 20 or 30, or
+    a value greater than the typical number of amphorae that would be expected
+    on a single host.
--- a/requirements.txt
+++ b/requirements.txt
@ -46,7 +46,7 @@ castellan>=0.16.0 # Apache-2.0
 tenacity>=5.0.4  # Apache-2.0
 distro>=1.2.0 # Apache-2.0
 jsonschema>=3.2.0 # MIT
-octavia-lib>=2.5.0 # Apache-2.0
+octavia-lib>=3.1.0 # Apache-2.0
 simplejson>=3.13.2 # MIT
 setproctitle>=1.1.10 # BSD
 python-dateutil>=2.7.0 # BSD