Failover stop threshold / circuit breaker
Stop failovers if the count of simultaneously failed amphora reaches the number configured in the new failover_threshold option. This may prevent large scale accidental failover events, like in the case of network failures or read-only database issues. Story: 2005604 Task: 30837 Co-Authored-By: Tatsuma Matsuki <matsuki.tatsuma@jp.fujitsu.com> Co-Authored-By: Tom Weininger <tweining@redhat.com> Change-Id: I0d2c332fa72e47e70d594579ab819a6ece094cdd
This commit is contained in:
parent
8c793f2e8f
commit
1d19b702b1
131
doc/source/admin/failover-circuit-breaker.rst
Normal file
131
doc/source/admin/failover-circuit-breaker.rst
Normal file
@ -0,0 +1,131 @@
|
||||
..
|
||||
Copyright Red Hat
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may
|
||||
not use this file except in compliance with the License. You may obtain
|
||||
a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
|
||||
WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
|
||||
License for the specific language governing permissions and limitations
|
||||
under the License.
|
||||
|
||||
========================================
|
||||
Octavia Amphora Failover Circuit Breaker
|
||||
========================================
|
||||
|
||||
During a large infrastructure outage, the automatic failover of stale
|
||||
amphorae can lead to a mass failover event and create a considerable
|
||||
amount of extra load on servers. By using the amphora failover
|
||||
circuit breaker feature, you can avoid these unwanted failover events.
|
||||
The circuit breaker is a configurable threshold value that you can set,
|
||||
and will stop amphorae from automatically failing over whenever that
|
||||
threshold value is met. The circuit breaker feature is disabled by default.
|
||||
|
||||
Configuration
|
||||
=============
|
||||
|
||||
You define the threshold value for the failover circuit breaker feature
|
||||
by setting the *failover_threshold* variable. The *failover_threshold*
|
||||
variable is a member of the *health_manager* group within the
|
||||
configuration file ``/etc/octavia/octavia.conf``.
|
||||
|
||||
Whenever the number of stale amphorae reaches or surpasses the value
|
||||
of *failover_threshold*, Octavia performs the following actions:
|
||||
|
||||
* stops automatic failovers of amphorae.
|
||||
* sets the status of the stale amphorae to *FAILOVER_STOPPED*.
|
||||
* logs an error message.
|
||||
|
||||
The line below shows a typical error message:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
ERROR octavia.db.repositories [-] Stale amphora count reached the threshold (3). 4 amphorae were set into FAILOVER_STOPPED status.
|
||||
|
||||
.. note:: Base the value that you set for *failover_threshold* on the
|
||||
size of your environment. We recommend that you set the value to a number
|
||||
greater than the typical number of amphorae that you estimate to run on a
|
||||
single host, or to a value that reflects between 20% and 30%
|
||||
of the total number of amphorae.
|
||||
|
||||
Error Recovery
|
||||
==============
|
||||
|
||||
Automatic Error Recovery
|
||||
------------------------
|
||||
|
||||
For amphorae whose status is *FAILOVER_STOPPED*, Octavia will
|
||||
automatically reset their status to *ALLOCATED* after receiving
|
||||
new updates from these amphorae.
|
||||
|
||||
Manual Error Recovery
|
||||
---------------------
|
||||
|
||||
To recover from the *FAILOVER_STOPPED* condition, you must
|
||||
manually reduce the value of the stale amphorae below the
|
||||
circuit breaker threshold.
|
||||
|
||||
You can use the ``openstack loadbalancer amphora list`` command
|
||||
to list the amphorae that are in *FAILOVER_STOPPED* state.
|
||||
Use the ``openstack loadbalancer amphora failover`` command to
|
||||
manually trigger the amphora to failover.
|
||||
|
||||
In this example, *failover_threshold = 3* and an infrastructure
|
||||
outage caused four amphorae to become unavailable. After the
|
||||
health manager process detects this state, it sets the status
|
||||
of all stale amphorae to *FAILOVER_STOPPED* as shown below.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
openstack loadbalancer amphora list
|
||||
+--------------------------------------+--------------------------------------+------------------+--------+---------------+------------+
|
||||
| id | loadbalancer_id | status | role | lb_network_ip | ha_ip |
|
||||
+--------------------------------------+--------------------------------------+------------------+--------+---------------+------------+
|
||||
| 79f0e06d-446d-448a-9d2b-c3b89d0c700d | 8fd2cac5-cbca-4bb1-bcfc-daba43e097ab | FAILOVER_STOPPED | BACKUP | 192.168.0.108 | 192.0.2.17 |
|
||||
| 9c0416d7-6293-4f13-8f67-61e5d757b36e | 4b13dda1-296a-400c-8248-1abad5728057 | ALLOCATED | MASTER | 192.168.0.198 | 192.0.2.42 |
|
||||
| e11208b7-f13d-4db3-9ded-1ee6f70a0502 | 8fd2cac5-cbca-4bb1-bcfc-daba43e097ab | FAILOVER_STOPPED | MASTER | 192.168.0.154 | 192.0.2.17 |
|
||||
| ceea9fff-71a2-48c8-a968-e51dc440c572 | ab513cb3-8f5d-461e-b7ae-a06b5083a371 | ALLOCATED | MASTER | 192.168.0.149 | 192.0.2.26 |
|
||||
| a1351933-2270-493c-8201-d8f9f9fe42f7 | 4b13dda1-296a-400c-8248-1abad5728057 | FAILOVER_STOPPED | BACKUP | 192.168.0.103 | 192.0.2.42 |
|
||||
| 441718e7-0956-436b-9f99-9a476339d7d2 | ab513cb3-8f5d-461e-b7ae-a06b5083a371 | FAILOVER_STOPPED | BACKUP | 192.168.0.148 | 192.0.2.26 |
|
||||
+--------------------------------------+--------------------------------------+------------------+--------+---------------+------------+
|
||||
|
||||
After operators have resolved the infrastructure outage,
|
||||
they might need to manually trigger failovers to return to
|
||||
normal operation. In this example, two manual failovers are
|
||||
necessary to get the number of stale amphorae below the
|
||||
configured threshold of three:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
openstack loadbalancer amphora failover --wait 79f0e06d-446d-448a-9d2b-c3b89d0c700d
|
||||
openstack loadbalancer amphora list
|
||||
+--------------------------------------+--------------------------------------+------------------+--------+---------------+------------+
|
||||
| id | loadbalancer_id | status | role | lb_network_ip | ha_ip |
|
||||
+--------------------------------------+--------------------------------------+------------------+--------+---------------+------------+
|
||||
| 9c0416d7-6293-4f13-8f67-61e5d757b36e | 4b13dda1-296a-400c-8248-1abad5728057 | ALLOCATED | MASTER | 192.168.0.198 | 192.0.2.42 |
|
||||
| e11208b7-f13d-4db3-9ded-1ee6f70a0502 | 8fd2cac5-cbca-4bb1-bcfc-daba43e097ab | FAILOVER_STOPPED | MASTER | 192.168.0.154 | 192.0.2.17 |
|
||||
| ceea9fff-71a2-48c8-a968-e51dc440c572 | ab513cb3-8f5d-461e-b7ae-a06b5083a371 | ALLOCATED | MASTER | 192.168.0.149 | 192.0.2.26 |
|
||||
| a1351933-2270-493c-8201-d8f9f9fe42f7 | 4b13dda1-296a-400c-8248-1abad5728057 | FAILOVER_STOPPED | BACKUP | 192.168.0.103 | 192.0.2.42 |
|
||||
| 441718e7-0956-436b-9f99-9a476339d7d2 | ab513cb3-8f5d-461e-b7ae-a06b5083a371 | FAILOVER_STOPPED | BACKUP | 192.168.0.148 | 192.0.2.26 |
|
||||
| cf734b57-6019-4ec0-8437-115f76d1bbb0 | 8fd2cac5-cbca-4bb1-bcfc-daba43e097ab | ALLOCATED | BACKUP | 192.168.0.141 | 192.0.2.17 |
|
||||
+--------------------------------------+--------------------------------------+------------------+--------+---------------+------------+
|
||||
openstack loadbalancer amphora failover --wait e11208b7-f13d-4db3-9ded-1ee6f70a0502
|
||||
openstack loadbalancer amphora list
|
||||
+--------------------------------------+--------------------------------------+-----------+--------+---------------+------------+
|
||||
| id | loadbalancer_id | status | role | lb_network_ip | ha_ip |
|
||||
+--------------------------------------+--------------------------------------+-----------+--------+---------------+------------+
|
||||
| 9c0416d7-6293-4f13-8f67-61e5d757b36e | 4b13dda1-296a-400c-8248-1abad5728057 | ALLOCATED | MASTER | 192.168.0.198 | 192.0.2.42 |
|
||||
| ceea9fff-71a2-48c8-a968-e51dc440c572 | ab513cb3-8f5d-461e-b7ae-a06b5083a371 | ALLOCATED | MASTER | 192.168.0.149 | 192.0.2.26 |
|
||||
| cf734b57-6019-4ec0-8437-115f76d1bbb0 | 8fd2cac5-cbca-4bb1-bcfc-daba43e097ab | ALLOCATED | BACKUP | 192.168.0.141 | 192.0.2.17 |
|
||||
| d2909051-402e-4e75-86c9-ec6725c814a1 | 8fd2cac5-cbca-4bb1-bcfc-daba43e097ab | ALLOCATED | MASTER | 192.168.0.25 | 192.0.2.17 |
|
||||
| 5133e01a-fb53-457b-b810-edbb5202437e | 4b13dda1-296a-400c-8248-1abad5728057 | ALLOCATED | BACKUP | 192.168.0.76 | 192.0.2.42 |
|
||||
| f82eff89-e326-4e9d-86bc-58c720220a3f | ab513cb3-8f5d-461e-b7ae-a06b5083a371 | ALLOCATED | BACKUP | 192.168.0.86 | 192.0.2.26 |
|
||||
+--------------------------------------+--------------------------------------+-----------+--------+---------------+------------+
|
||||
|
||||
After the number of stale amphorae falls below the configured
|
||||
threshold value, normal operation resumes and the automatic
|
||||
failover process attempts to restore the remaining stale amphorae.
|
@ -36,6 +36,7 @@ Optional Installation and Configuration Guides
|
||||
healthcheck.rst
|
||||
flavors.rst
|
||||
apache-httpd.rst
|
||||
failover-circuit-breaker.rst
|
||||
|
||||
Maintanence and Operations
|
||||
--------------------------
|
||||
|
@ -128,6 +128,11 @@
|
||||
# heartbeat_timeout = 60
|
||||
# health_check_interval = 3
|
||||
# sock_rlimit = 0
|
||||
# Stop failovers if the count of simultaneously failed
|
||||
# amphora reaches this number (circuit breaker). This may prevent large
|
||||
# scale accidental failover events, like in the case of
|
||||
# network failures or read-only database issues.
|
||||
# failover_threshold =
|
||||
|
||||
[keystone_authtoken]
|
||||
# This group of config options are imported from keystone middleware. Thus the
|
||||
|
@ -304,6 +304,11 @@ health_manager_opts = [
|
||||
help=_('Sleep time between health checks in seconds.')),
|
||||
cfg.IntOpt('sock_rlimit', default=0,
|
||||
help=_(' sets the value of the heartbeat recv buffer')),
|
||||
cfg.IntOpt('failover_threshold', default=None,
|
||||
help=_('Stop failovers if the count of simultaneously failed '
|
||||
'amphora reaches this number. This may prevent large '
|
||||
'scale accidental failover events, like in the case of '
|
||||
'network failures or read-only database issues.')),
|
||||
|
||||
# Used by the health manager on the amphora
|
||||
cfg.ListOpt('controller_ip_port_list',
|
||||
|
@ -161,6 +161,8 @@ AMPHORA_ALLOCATED = lib_consts.AMPHORA_ALLOCATED
|
||||
AMPHORA_BOOTING = lib_consts.AMPHORA_BOOTING
|
||||
# Amphora is ready to be allocated to a load balancer 'READY'
|
||||
AMPHORA_READY = lib_consts.AMPHORA_READY
|
||||
# 'FAILOVER_STOPPED'. Failover threshold level has been reached.
|
||||
AMPHORA_FAILOVER_STOPPED = lib_consts.AMPHORA_FAILOVER_STOPPED
|
||||
# 'ACTIVE'
|
||||
ACTIVE = lib_consts.ACTIVE
|
||||
# 'PENDING_DELETE'
|
||||
@ -248,13 +250,6 @@ MUTABLE_STATUSES = (lib_consts.ACTIVE,)
|
||||
DELETABLE_STATUSES = (lib_consts.ACTIVE, lib_consts.ERROR)
|
||||
FAILOVERABLE_STATUSES = (lib_consts.ACTIVE, lib_consts.ERROR)
|
||||
|
||||
# Note: The database Amphora table has a foreign key constraint against
|
||||
# the provisioning_status table
|
||||
SUPPORTED_AMPHORA_STATUSES = (
|
||||
lib_consts.AMPHORA_ALLOCATED, lib_consts.AMPHORA_BOOTING, lib_consts.ERROR,
|
||||
lib_consts.AMPHORA_READY, lib_consts.DELETED, lib_consts.PENDING_CREATE,
|
||||
lib_consts.PENDING_DELETE)
|
||||
|
||||
AMPHORA_VM = 'VM'
|
||||
SUPPORTED_AMPHORA_TYPES = (AMPHORA_VM,)
|
||||
|
||||
|
@ -92,7 +92,6 @@ class HealthManager(object):
|
||||
lock_session = None
|
||||
try:
|
||||
lock_session = db_api.get_session(autocommit=False)
|
||||
amp = None
|
||||
amp_health = self.amp_health_repo.get_stale_amphora(
|
||||
lock_session)
|
||||
if amp_health:
|
||||
|
@ -0,0 +1,41 @@
|
||||
# Copyright Red Hat
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License"); you may
|
||||
# not use this file except in compliance with the License. You may obtain
|
||||
# a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
|
||||
# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
|
||||
# License for the specific language governing permissions and limitations
|
||||
# under the License.
|
||||
"""Add FAILOVER_STOPPED to provisioning_status table
|
||||
|
||||
Revision ID: 0995c26fc506
|
||||
Revises: 31f7653ded67
|
||||
Create Date: 2022-03-24 04:53:10.768658
|
||||
|
||||
"""
|
||||
from alembic import op
|
||||
import sqlalchemy as sa
|
||||
|
||||
# revision identifiers, used by Alembic.
|
||||
revision = '0995c26fc506'
|
||||
down_revision = '31f7653ded67'
|
||||
|
||||
|
||||
def upgrade():
|
||||
insert_table = sa.sql.table(
|
||||
'provisioning_status',
|
||||
sa.sql.column('name', sa.String),
|
||||
sa.sql.column('description', sa.String)
|
||||
)
|
||||
|
||||
op.bulk_insert(
|
||||
insert_table,
|
||||
[
|
||||
{'name': 'FAILOVER_STOPPED'},
|
||||
]
|
||||
)
|
@ -19,6 +19,7 @@ reference
|
||||
"""
|
||||
|
||||
import datetime
|
||||
from typing import Optional
|
||||
|
||||
from oslo_config import cfg
|
||||
from oslo_db import api as oslo_db_api
|
||||
@ -28,9 +29,12 @@ from oslo_serialization import jsonutils
|
||||
from oslo_utils import excutils
|
||||
from oslo_utils import uuidutils
|
||||
from sqlalchemy.orm import noload
|
||||
from sqlalchemy.orm import Session
|
||||
from sqlalchemy.orm import subqueryload
|
||||
from sqlalchemy import select
|
||||
from sqlalchemy.sql.expression import false
|
||||
from sqlalchemy.sql import func
|
||||
from sqlalchemy import update
|
||||
|
||||
from octavia.common import constants as consts
|
||||
from octavia.common import data_models
|
||||
@ -1678,28 +1682,99 @@ class AmphoraHealthRepository(BaseRepository):
|
||||
# In this case, the amphora is expired.
|
||||
return amphora_model is None
|
||||
|
||||
def get_stale_amphora(self, session):
|
||||
def get_stale_amphora(self,
|
||||
lock_session: Session) -> Optional[models.Amphora]:
|
||||
"""Retrieves a stale amphora from the health manager database.
|
||||
|
||||
:param session: A Sql Alchemy database session.
|
||||
:param lock_session: A Sql Alchemy database autocommit session.
|
||||
:returns: [octavia.common.data_model]
|
||||
"""
|
||||
|
||||
timeout = CONF.health_manager.heartbeat_timeout
|
||||
expired_time = datetime.datetime.utcnow() - datetime.timedelta(
|
||||
seconds=timeout)
|
||||
|
||||
amp = session.query(self.model_class).with_for_update().filter_by(
|
||||
busy=False).filter(
|
||||
self.model_class.last_update < expired_time).order_by(
|
||||
func.random()).first()
|
||||
# Update any amphora that were previously FAILOVER_STOPPED
|
||||
# but are no longer expired.
|
||||
self.update_failover_stopped(lock_session, expired_time)
|
||||
|
||||
if amp is None:
|
||||
# Handle expired amphora
|
||||
expired_ids_query = select(self.model_class.amphora_id).where(
|
||||
self.model_class.busy == false()).where(
|
||||
self.model_class.last_update < expired_time)
|
||||
|
||||
expired_count = lock_session.scalar(
|
||||
select(func.count()).select_from(expired_ids_query))
|
||||
|
||||
threshold = CONF.health_manager.failover_threshold
|
||||
if threshold is not None and expired_count >= threshold:
|
||||
LOG.error('Stale amphora count reached the threshold '
|
||||
'(%(th)s). %(count)s amphorae were set into '
|
||||
'FAILOVER_STOPPED status.',
|
||||
{'th': threshold, 'count': expired_count})
|
||||
lock_session.execute(
|
||||
update(
|
||||
models.Amphora
|
||||
).where(
|
||||
models.Amphora.status.notin_(
|
||||
[consts.DELETED, consts.PENDING_DELETE])
|
||||
).where(
|
||||
models.Amphora.id.in_(expired_ids_query)
|
||||
).values(
|
||||
status=consts.AMPHORA_FAILOVER_STOPPED
|
||||
).execution_options(synchronize_session="fetch"))
|
||||
return None
|
||||
|
||||
amp.busy = True
|
||||
# We don't want to attempt to failover amphora that are not
|
||||
# currently in the ALLOCATED or FAILOVER_STOPPED state.
|
||||
# i.e. Not DELETED, PENDING_*, etc.
|
||||
allocated_amp_ids_subquery = (
|
||||
select(models.Amphora.id).where(
|
||||
models.Amphora.status.in_(
|
||||
[consts.AMPHORA_ALLOCATED,
|
||||
consts.AMPHORA_FAILOVER_STOPPED])))
|
||||
|
||||
return amp.to_data_model()
|
||||
# Pick one expired amphora for automatic failover
|
||||
amp_health = lock_session.query(
|
||||
self.model_class
|
||||
).with_for_update(
|
||||
).filter(
|
||||
self.model_class.amphora_id.in_(expired_ids_query)
|
||||
).filter(
|
||||
self.model_class.amphora_id.in_(allocated_amp_ids_subquery)
|
||||
).order_by(
|
||||
func.random()
|
||||
).limit(1).first()
|
||||
|
||||
if amp_health is None:
|
||||
return None
|
||||
|
||||
amp_health.busy = True
|
||||
|
||||
return amp_health.to_data_model()
|
||||
|
||||
def update_failover_stopped(self, lock_session: Session,
|
||||
expired_time: datetime) -> None:
|
||||
"""Updates the status of amps that are FAILOVER_STOPPED."""
|
||||
# Update any FAILOVER_STOPPED amphora that are no longer stale
|
||||
# back to ALLOCATED.
|
||||
# Note: This uses sqlalchemy 2.0 syntax
|
||||
not_expired_ids_subquery = (
|
||||
select(self.model_class.amphora_id).where(
|
||||
self.model_class.busy == false()
|
||||
).where(
|
||||
self.model_class.last_update >= expired_time
|
||||
))
|
||||
|
||||
# Note: mysql and sqlite do not support RETURNING, so we cannot
|
||||
# get back the affected amphora IDs. (09/2022)
|
||||
lock_session.execute(
|
||||
update(models.Amphora).where(
|
||||
models.Amphora.status == consts.AMPHORA_FAILOVER_STOPPED
|
||||
).where(
|
||||
models.Amphora.id.in_(not_expired_ids_subquery)
|
||||
).values(
|
||||
status=consts.AMPHORA_ALLOCATED
|
||||
).execution_options(synchronize_session="fetch"))
|
||||
|
||||
|
||||
class VRRPGroupRepository(BaseRepository):
|
||||
|
@ -4144,12 +4144,31 @@ class AmphoraRepositoryTest(BaseRepositoryTest):
|
||||
class AmphoraHealthRepositoryTest(BaseRepositoryTest):
|
||||
def setUp(self):
|
||||
super().setUp()
|
||||
self._fake_ip_gen = (self.FAKE_IP + str(ip_end) for ip_end in
|
||||
range(100))
|
||||
self.amphora = self.amphora_repo.create(self.session,
|
||||
id=self.FAKE_UUID_1,
|
||||
compute_id=self.FAKE_UUID_3,
|
||||
status=constants.ACTIVE,
|
||||
lb_network_ip=self.FAKE_IP)
|
||||
|
||||
def create_amphora(self, amphora_id, **overrides):
|
||||
fake_ip = next(self._fake_ip_gen)
|
||||
settings = {
|
||||
'id': amphora_id,
|
||||
'compute_id': uuidutils.generate_uuid(),
|
||||
'status': constants.ACTIVE,
|
||||
'lb_network_ip': fake_ip,
|
||||
'vrrp_ip': fake_ip,
|
||||
'ha_ip': fake_ip,
|
||||
'role': constants.ROLE_MASTER,
|
||||
'cert_expiration': datetime.datetime.utcnow(),
|
||||
'cert_busy': False
|
||||
}
|
||||
settings.update(overrides)
|
||||
amphora = self.amphora_repo.create(self.session, **settings)
|
||||
return amphora
|
||||
|
||||
def create_amphora_health(self, amphora_id):
|
||||
newdate = datetime.datetime.utcnow() - datetime.timedelta(minutes=10)
|
||||
|
||||
@ -4216,10 +4235,94 @@ class AmphoraHealthRepositoryTest(BaseRepositoryTest):
|
||||
self.session)
|
||||
self.assertIsNone(stale_amphora)
|
||||
|
||||
self.create_amphora_health(self.amphora.id)
|
||||
uuid = uuidutils.generate_uuid()
|
||||
self.create_amphora(uuid)
|
||||
self.amphora_repo.update(self.session, uuid,
|
||||
status=constants.AMPHORA_ALLOCATED)
|
||||
self.create_amphora_health(uuid)
|
||||
stale_amphora = self.amphora_health_repo.get_stale_amphora(
|
||||
self.session)
|
||||
self.assertEqual(self.amphora.id, stale_amphora.amphora_id)
|
||||
self.assertEqual(uuid, stale_amphora.amphora_id)
|
||||
|
||||
def test_get_stale_amphora_past_threshold(self):
|
||||
conf = self.useFixture(oslo_fixture.Config(cfg.CONF))
|
||||
conf.config(group='health_manager', failover_threshold=3)
|
||||
|
||||
stale_amphora = self.amphora_health_repo.get_stale_amphora(
|
||||
self.session)
|
||||
self.assertIsNone(stale_amphora)
|
||||
|
||||
# Two stale amphora expected, should return that amp
|
||||
# These will go into failover and be marked "busy"
|
||||
uuids = []
|
||||
for _ in range(2):
|
||||
uuid = uuidutils.generate_uuid()
|
||||
uuids.append(uuid)
|
||||
self.create_amphora(uuid)
|
||||
self.amphora_repo.update(self.session, uuid,
|
||||
status=constants.AMPHORA_ALLOCATED)
|
||||
self.create_amphora_health(uuid)
|
||||
stale_amphora = self.amphora_health_repo.get_stale_amphora(
|
||||
self.session)
|
||||
self.assertIn(stale_amphora.amphora_id, uuids)
|
||||
|
||||
# Creating more stale amphorae should return no amps (past threshold)
|
||||
stale_uuids = []
|
||||
for _ in range(4):
|
||||
uuid = uuidutils.generate_uuid()
|
||||
stale_uuids.append(uuid)
|
||||
self.create_amphora(uuid)
|
||||
self.amphora_repo.update(self.session, uuid,
|
||||
status=constants.AMPHORA_ALLOCATED)
|
||||
self.create_amphora_health(uuid)
|
||||
stale_amphora = self.amphora_health_repo.get_stale_amphora(
|
||||
self.session)
|
||||
self.assertIsNone(stale_amphora)
|
||||
num_fo_stopped = self.session.query(db_models.Amphora).filter(
|
||||
db_models.Amphora.status == constants.AMPHORA_FAILOVER_STOPPED
|
||||
).count()
|
||||
# Note that the two amphora started failover, so are "busy" and
|
||||
# should not be marked FAILOVER_STOPPED.
|
||||
self.assertEqual(4, num_fo_stopped)
|
||||
|
||||
# One recovered, but still over threshold
|
||||
# Two "busy", One fully healthy, three in FAILOVER_STOPPED
|
||||
amp = self.session.query(db_models.AmphoraHealth).filter_by(
|
||||
amphora_id=stale_uuids[2]).first()
|
||||
amp.last_update = datetime.datetime.utcnow()
|
||||
self.session.flush()
|
||||
stale_amphora = self.amphora_health_repo.get_stale_amphora(
|
||||
self.session)
|
||||
self.assertIsNone(stale_amphora)
|
||||
num_fo_stopped = self.session.query(db_models.Amphora).filter(
|
||||
db_models.Amphora.status == constants.AMPHORA_FAILOVER_STOPPED
|
||||
).count()
|
||||
self.assertEqual(3, num_fo_stopped)
|
||||
|
||||
# Another one recovered, now below threshold
|
||||
# Two are "busy", Two are fully healthy, Two are in FAILOVER_STOPPED
|
||||
amp = self.session.query(db_models.AmphoraHealth).filter_by(
|
||||
amphora_id=stale_uuids[3]).first()
|
||||
amp.last_update = datetime.datetime.utcnow()
|
||||
stale_amphora = self.amphora_health_repo.get_stale_amphora(
|
||||
self.session)
|
||||
self.assertIsNotNone(stale_amphora)
|
||||
num_fo_stopped = self.session.query(db_models.Amphora).filter(
|
||||
db_models.Amphora.status == constants.AMPHORA_FAILOVER_STOPPED
|
||||
).count()
|
||||
self.assertEqual(2, num_fo_stopped)
|
||||
|
||||
# After error recovery all amps should be allocated again
|
||||
now = datetime.datetime.utcnow()
|
||||
for amp in self.session.query(db_models.AmphoraHealth).all():
|
||||
amp.last_update = now
|
||||
stale_amphora = self.amphora_health_repo.get_stale_amphora(
|
||||
self.session)
|
||||
self.assertIsNone(stale_amphora)
|
||||
num_allocated = self.session.query(db_models.Amphora).filter(
|
||||
db_models.Amphora.status == constants.AMPHORA_ALLOCATED
|
||||
).count()
|
||||
self.assertEqual(5, num_allocated)
|
||||
|
||||
def test_create(self):
|
||||
amphora_health = self.create_amphora_health(self.FAKE_UUID_1)
|
||||
|
12
releasenotes/notes/failover-threshold-f5cdf2bbe8a64d6d.yaml
Normal file
12
releasenotes/notes/failover-threshold-f5cdf2bbe8a64d6d.yaml
Normal file
@ -0,0 +1,12 @@
|
||||
---
|
||||
features:
|
||||
- |
|
||||
A new configuration option ``failover_threshold`` can be set to limit the
|
||||
number of amphorae simultaneously pending failover before halting the
|
||||
automatic failover process. This should help prevent unwanted mass failover
|
||||
events that can happen in cases like network interruption to an AZ or the
|
||||
database becoming read-only. This feature is not enabled by default, and it
|
||||
should be configured carefully based on the size of the environment.
|
||||
For example, with 100 amphorae a good threshold might be 20 or 30, or
|
||||
a value greater than the typical number of amphorae that would be expected
|
||||
on a single host.
|
@ -46,7 +46,7 @@ castellan>=0.16.0 # Apache-2.0
|
||||
tenacity>=5.0.4 # Apache-2.0
|
||||
distro>=1.2.0 # Apache-2.0
|
||||
jsonschema>=3.2.0 # MIT
|
||||
octavia-lib>=2.5.0 # Apache-2.0
|
||||
octavia-lib>=3.1.0 # Apache-2.0
|
||||
simplejson>=3.13.2 # MIT
|
||||
setproctitle>=1.1.10 # BSD
|
||||
python-dateutil>=2.7.0 # BSD
|
||||
|
Loading…
Reference in New Issue
Block a user