Merge "Failover stop threshold / circuit breaker"
This commit is contained in:
commit
93a2b0a0c3
131
doc/source/admin/failover-circuit-breaker.rst
Normal file
131
doc/source/admin/failover-circuit-breaker.rst
Normal file
@ -0,0 +1,131 @@
|
|||||||
|
..
|
||||||
|
Copyright Red Hat
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License"); you may
|
||||||
|
not use this file except in compliance with the License. You may obtain
|
||||||
|
a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software
|
||||||
|
distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
|
||||||
|
WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
|
||||||
|
License for the specific language governing permissions and limitations
|
||||||
|
under the License.
|
||||||
|
|
||||||
|
========================================
|
||||||
|
Octavia Amphora Failover Circuit Breaker
|
||||||
|
========================================
|
||||||
|
|
||||||
|
During a large infrastructure outage, the automatic failover of stale
|
||||||
|
amphorae can lead to a mass failover event and create a considerable
|
||||||
|
amount of extra load on servers. By using the amphora failover
|
||||||
|
circuit breaker feature, you can avoid these unwanted failover events.
|
||||||
|
The circuit breaker is a configurable threshold value that you can set,
|
||||||
|
and will stop amphorae from automatically failing over whenever that
|
||||||
|
threshold value is met. The circuit breaker feature is disabled by default.
|
||||||
|
|
||||||
|
Configuration
|
||||||
|
=============
|
||||||
|
|
||||||
|
You define the threshold value for the failover circuit breaker feature
|
||||||
|
by setting the *failover_threshold* variable. The *failover_threshold*
|
||||||
|
variable is a member of the *health_manager* group within the
|
||||||
|
configuration file ``/etc/octavia/octavia.conf``.
|
||||||
|
|
||||||
|
Whenever the number of stale amphorae reaches or surpasses the value
|
||||||
|
of *failover_threshold*, Octavia performs the following actions:
|
||||||
|
|
||||||
|
* stops automatic failovers of amphorae.
|
||||||
|
* sets the status of the stale amphorae to *FAILOVER_STOPPED*.
|
||||||
|
* logs an error message.
|
||||||
|
|
||||||
|
The line below shows a typical error message:
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
ERROR octavia.db.repositories [-] Stale amphora count reached the threshold (3). 4 amphorae were set into FAILOVER_STOPPED status.
|
||||||
|
|
||||||
|
.. note:: Base the value that you set for *failover_threshold* on the
|
||||||
|
size of your environment. We recommend that you set the value to a number
|
||||||
|
greater than the typical number of amphorae that you estimate to run on a
|
||||||
|
single host, or to a value that reflects between 20% and 30%
|
||||||
|
of the total number of amphorae.
|
||||||
|
|
||||||
|
Error Recovery
|
||||||
|
==============
|
||||||
|
|
||||||
|
Automatic Error Recovery
|
||||||
|
------------------------
|
||||||
|
|
||||||
|
For amphorae whose status is *FAILOVER_STOPPED*, Octavia will
|
||||||
|
automatically reset their status to *ALLOCATED* after receiving
|
||||||
|
new updates from these amphorae.
|
||||||
|
|
||||||
|
Manual Error Recovery
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
To recover from the *FAILOVER_STOPPED* condition, you must
|
||||||
|
manually reduce the value of the stale amphorae below the
|
||||||
|
circuit breaker threshold.
|
||||||
|
|
||||||
|
You can use the ``openstack loadbalancer amphora list`` command
|
||||||
|
to list the amphorae that are in *FAILOVER_STOPPED* state.
|
||||||
|
Use the ``openstack loadbalancer amphora failover`` command to
|
||||||
|
manually trigger the amphora to failover.
|
||||||
|
|
||||||
|
In this example, *failover_threshold = 3* and an infrastructure
|
||||||
|
outage caused four amphorae to become unavailable. After the
|
||||||
|
health manager process detects this state, it sets the status
|
||||||
|
of all stale amphorae to *FAILOVER_STOPPED* as shown below.
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
openstack loadbalancer amphora list
|
||||||
|
+--------------------------------------+--------------------------------------+------------------+--------+---------------+------------+
|
||||||
|
| id | loadbalancer_id | status | role | lb_network_ip | ha_ip |
|
||||||
|
+--------------------------------------+--------------------------------------+------------------+--------+---------------+------------+
|
||||||
|
| 79f0e06d-446d-448a-9d2b-c3b89d0c700d | 8fd2cac5-cbca-4bb1-bcfc-daba43e097ab | FAILOVER_STOPPED | BACKUP | 192.168.0.108 | 192.0.2.17 |
|
||||||
|
| 9c0416d7-6293-4f13-8f67-61e5d757b36e | 4b13dda1-296a-400c-8248-1abad5728057 | ALLOCATED | MASTER | 192.168.0.198 | 192.0.2.42 |
|
||||||
|
| e11208b7-f13d-4db3-9ded-1ee6f70a0502 | 8fd2cac5-cbca-4bb1-bcfc-daba43e097ab | FAILOVER_STOPPED | MASTER | 192.168.0.154 | 192.0.2.17 |
|
||||||
|
| ceea9fff-71a2-48c8-a968-e51dc440c572 | ab513cb3-8f5d-461e-b7ae-a06b5083a371 | ALLOCATED | MASTER | 192.168.0.149 | 192.0.2.26 |
|
||||||
|
| a1351933-2270-493c-8201-d8f9f9fe42f7 | 4b13dda1-296a-400c-8248-1abad5728057 | FAILOVER_STOPPED | BACKUP | 192.168.0.103 | 192.0.2.42 |
|
||||||
|
| 441718e7-0956-436b-9f99-9a476339d7d2 | ab513cb3-8f5d-461e-b7ae-a06b5083a371 | FAILOVER_STOPPED | BACKUP | 192.168.0.148 | 192.0.2.26 |
|
||||||
|
+--------------------------------------+--------------------------------------+------------------+--------+---------------+------------+
|
||||||
|
|
||||||
|
After operators have resolved the infrastructure outage,
|
||||||
|
they might need to manually trigger failovers to return to
|
||||||
|
normal operation. In this example, two manual failovers are
|
||||||
|
necessary to get the number of stale amphorae below the
|
||||||
|
configured threshold of three:
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
openstack loadbalancer amphora failover --wait 79f0e06d-446d-448a-9d2b-c3b89d0c700d
|
||||||
|
openstack loadbalancer amphora list
|
||||||
|
+--------------------------------------+--------------------------------------+------------------+--------+---------------+------------+
|
||||||
|
| id | loadbalancer_id | status | role | lb_network_ip | ha_ip |
|
||||||
|
+--------------------------------------+--------------------------------------+------------------+--------+---------------+------------+
|
||||||
|
| 9c0416d7-6293-4f13-8f67-61e5d757b36e | 4b13dda1-296a-400c-8248-1abad5728057 | ALLOCATED | MASTER | 192.168.0.198 | 192.0.2.42 |
|
||||||
|
| e11208b7-f13d-4db3-9ded-1ee6f70a0502 | 8fd2cac5-cbca-4bb1-bcfc-daba43e097ab | FAILOVER_STOPPED | MASTER | 192.168.0.154 | 192.0.2.17 |
|
||||||
|
| ceea9fff-71a2-48c8-a968-e51dc440c572 | ab513cb3-8f5d-461e-b7ae-a06b5083a371 | ALLOCATED | MASTER | 192.168.0.149 | 192.0.2.26 |
|
||||||
|
| a1351933-2270-493c-8201-d8f9f9fe42f7 | 4b13dda1-296a-400c-8248-1abad5728057 | FAILOVER_STOPPED | BACKUP | 192.168.0.103 | 192.0.2.42 |
|
||||||
|
| 441718e7-0956-436b-9f99-9a476339d7d2 | ab513cb3-8f5d-461e-b7ae-a06b5083a371 | FAILOVER_STOPPED | BACKUP | 192.168.0.148 | 192.0.2.26 |
|
||||||
|
| cf734b57-6019-4ec0-8437-115f76d1bbb0 | 8fd2cac5-cbca-4bb1-bcfc-daba43e097ab | ALLOCATED | BACKUP | 192.168.0.141 | 192.0.2.17 |
|
||||||
|
+--------------------------------------+--------------------------------------+------------------+--------+---------------+------------+
|
||||||
|
openstack loadbalancer amphora failover --wait e11208b7-f13d-4db3-9ded-1ee6f70a0502
|
||||||
|
openstack loadbalancer amphora list
|
||||||
|
+--------------------------------------+--------------------------------------+-----------+--------+---------------+------------+
|
||||||
|
| id | loadbalancer_id | status | role | lb_network_ip | ha_ip |
|
||||||
|
+--------------------------------------+--------------------------------------+-----------+--------+---------------+------------+
|
||||||
|
| 9c0416d7-6293-4f13-8f67-61e5d757b36e | 4b13dda1-296a-400c-8248-1abad5728057 | ALLOCATED | MASTER | 192.168.0.198 | 192.0.2.42 |
|
||||||
|
| ceea9fff-71a2-48c8-a968-e51dc440c572 | ab513cb3-8f5d-461e-b7ae-a06b5083a371 | ALLOCATED | MASTER | 192.168.0.149 | 192.0.2.26 |
|
||||||
|
| cf734b57-6019-4ec0-8437-115f76d1bbb0 | 8fd2cac5-cbca-4bb1-bcfc-daba43e097ab | ALLOCATED | BACKUP | 192.168.0.141 | 192.0.2.17 |
|
||||||
|
| d2909051-402e-4e75-86c9-ec6725c814a1 | 8fd2cac5-cbca-4bb1-bcfc-daba43e097ab | ALLOCATED | MASTER | 192.168.0.25 | 192.0.2.17 |
|
||||||
|
| 5133e01a-fb53-457b-b810-edbb5202437e | 4b13dda1-296a-400c-8248-1abad5728057 | ALLOCATED | BACKUP | 192.168.0.76 | 192.0.2.42 |
|
||||||
|
| f82eff89-e326-4e9d-86bc-58c720220a3f | ab513cb3-8f5d-461e-b7ae-a06b5083a371 | ALLOCATED | BACKUP | 192.168.0.86 | 192.0.2.26 |
|
||||||
|
+--------------------------------------+--------------------------------------+-----------+--------+---------------+------------+
|
||||||
|
|
||||||
|
After the number of stale amphorae falls below the configured
|
||||||
|
threshold value, normal operation resumes and the automatic
|
||||||
|
failover process attempts to restore the remaining stale amphorae.
|
@ -36,6 +36,7 @@ Optional Installation and Configuration Guides
|
|||||||
healthcheck.rst
|
healthcheck.rst
|
||||||
flavors.rst
|
flavors.rst
|
||||||
apache-httpd.rst
|
apache-httpd.rst
|
||||||
|
failover-circuit-breaker.rst
|
||||||
|
|
||||||
Maintanence and Operations
|
Maintanence and Operations
|
||||||
--------------------------
|
--------------------------
|
||||||
|
@ -128,6 +128,11 @@
|
|||||||
# heartbeat_timeout = 60
|
# heartbeat_timeout = 60
|
||||||
# health_check_interval = 3
|
# health_check_interval = 3
|
||||||
# sock_rlimit = 0
|
# sock_rlimit = 0
|
||||||
|
# Stop failovers if the count of simultaneously failed
|
||||||
|
# amphora reaches this number (circuit breaker). This may prevent large
|
||||||
|
# scale accidental failover events, like in the case of
|
||||||
|
# network failures or read-only database issues.
|
||||||
|
# failover_threshold =
|
||||||
|
|
||||||
[keystone_authtoken]
|
[keystone_authtoken]
|
||||||
# This group of config options are imported from keystone middleware. Thus the
|
# This group of config options are imported from keystone middleware. Thus the
|
||||||
|
@ -304,6 +304,11 @@ health_manager_opts = [
|
|||||||
help=_('Sleep time between health checks in seconds.')),
|
help=_('Sleep time between health checks in seconds.')),
|
||||||
cfg.IntOpt('sock_rlimit', default=0,
|
cfg.IntOpt('sock_rlimit', default=0,
|
||||||
help=_(' sets the value of the heartbeat recv buffer')),
|
help=_(' sets the value of the heartbeat recv buffer')),
|
||||||
|
cfg.IntOpt('failover_threshold', default=None,
|
||||||
|
help=_('Stop failovers if the count of simultaneously failed '
|
||||||
|
'amphora reaches this number. This may prevent large '
|
||||||
|
'scale accidental failover events, like in the case of '
|
||||||
|
'network failures or read-only database issues.')),
|
||||||
|
|
||||||
# Used by the health manager on the amphora
|
# Used by the health manager on the amphora
|
||||||
cfg.ListOpt('controller_ip_port_list',
|
cfg.ListOpt('controller_ip_port_list',
|
||||||
|
@ -161,6 +161,8 @@ AMPHORA_ALLOCATED = lib_consts.AMPHORA_ALLOCATED
|
|||||||
AMPHORA_BOOTING = lib_consts.AMPHORA_BOOTING
|
AMPHORA_BOOTING = lib_consts.AMPHORA_BOOTING
|
||||||
# Amphora is ready to be allocated to a load balancer 'READY'
|
# Amphora is ready to be allocated to a load balancer 'READY'
|
||||||
AMPHORA_READY = lib_consts.AMPHORA_READY
|
AMPHORA_READY = lib_consts.AMPHORA_READY
|
||||||
|
# 'FAILOVER_STOPPED'. Failover threshold level has been reached.
|
||||||
|
AMPHORA_FAILOVER_STOPPED = lib_consts.AMPHORA_FAILOVER_STOPPED
|
||||||
# 'ACTIVE'
|
# 'ACTIVE'
|
||||||
ACTIVE = lib_consts.ACTIVE
|
ACTIVE = lib_consts.ACTIVE
|
||||||
# 'PENDING_DELETE'
|
# 'PENDING_DELETE'
|
||||||
@ -248,13 +250,6 @@ MUTABLE_STATUSES = (lib_consts.ACTIVE,)
|
|||||||
DELETABLE_STATUSES = (lib_consts.ACTIVE, lib_consts.ERROR)
|
DELETABLE_STATUSES = (lib_consts.ACTIVE, lib_consts.ERROR)
|
||||||
FAILOVERABLE_STATUSES = (lib_consts.ACTIVE, lib_consts.ERROR)
|
FAILOVERABLE_STATUSES = (lib_consts.ACTIVE, lib_consts.ERROR)
|
||||||
|
|
||||||
# Note: The database Amphora table has a foreign key constraint against
|
|
||||||
# the provisioning_status table
|
|
||||||
SUPPORTED_AMPHORA_STATUSES = (
|
|
||||||
lib_consts.AMPHORA_ALLOCATED, lib_consts.AMPHORA_BOOTING, lib_consts.ERROR,
|
|
||||||
lib_consts.AMPHORA_READY, lib_consts.DELETED, lib_consts.PENDING_CREATE,
|
|
||||||
lib_consts.PENDING_DELETE)
|
|
||||||
|
|
||||||
AMPHORA_VM = 'VM'
|
AMPHORA_VM = 'VM'
|
||||||
SUPPORTED_AMPHORA_TYPES = (AMPHORA_VM,)
|
SUPPORTED_AMPHORA_TYPES = (AMPHORA_VM,)
|
||||||
|
|
||||||
|
@ -92,7 +92,6 @@ class HealthManager(object):
|
|||||||
lock_session = None
|
lock_session = None
|
||||||
try:
|
try:
|
||||||
lock_session = db_api.get_session(autocommit=False)
|
lock_session = db_api.get_session(autocommit=False)
|
||||||
amp = None
|
|
||||||
amp_health = self.amp_health_repo.get_stale_amphora(
|
amp_health = self.amp_health_repo.get_stale_amphora(
|
||||||
lock_session)
|
lock_session)
|
||||||
if amp_health:
|
if amp_health:
|
||||||
|
@ -0,0 +1,41 @@
|
|||||||
|
# Copyright Red Hat
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License"); you may
|
||||||
|
# not use this file except in compliance with the License. You may obtain
|
||||||
|
# a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
|
||||||
|
# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
|
||||||
|
# License for the specific language governing permissions and limitations
|
||||||
|
# under the License.
|
||||||
|
"""Add FAILOVER_STOPPED to provisioning_status table
|
||||||
|
|
||||||
|
Revision ID: 0995c26fc506
|
||||||
|
Revises: 31f7653ded67
|
||||||
|
Create Date: 2022-03-24 04:53:10.768658
|
||||||
|
|
||||||
|
"""
|
||||||
|
from alembic import op
|
||||||
|
import sqlalchemy as sa
|
||||||
|
|
||||||
|
# revision identifiers, used by Alembic.
|
||||||
|
revision = '0995c26fc506'
|
||||||
|
down_revision = '31f7653ded67'
|
||||||
|
|
||||||
|
|
||||||
|
def upgrade():
|
||||||
|
insert_table = sa.sql.table(
|
||||||
|
'provisioning_status',
|
||||||
|
sa.sql.column('name', sa.String),
|
||||||
|
sa.sql.column('description', sa.String)
|
||||||
|
)
|
||||||
|
|
||||||
|
op.bulk_insert(
|
||||||
|
insert_table,
|
||||||
|
[
|
||||||
|
{'name': 'FAILOVER_STOPPED'},
|
||||||
|
]
|
||||||
|
)
|
@ -19,6 +19,7 @@ reference
|
|||||||
"""
|
"""
|
||||||
|
|
||||||
import datetime
|
import datetime
|
||||||
|
from typing import Optional
|
||||||
|
|
||||||
from oslo_config import cfg
|
from oslo_config import cfg
|
||||||
from oslo_db import api as oslo_db_api
|
from oslo_db import api as oslo_db_api
|
||||||
@ -28,9 +29,12 @@ from oslo_serialization import jsonutils
|
|||||||
from oslo_utils import excutils
|
from oslo_utils import excutils
|
||||||
from oslo_utils import uuidutils
|
from oslo_utils import uuidutils
|
||||||
from sqlalchemy.orm import noload
|
from sqlalchemy.orm import noload
|
||||||
|
from sqlalchemy.orm import Session
|
||||||
from sqlalchemy.orm import subqueryload
|
from sqlalchemy.orm import subqueryload
|
||||||
|
from sqlalchemy import select
|
||||||
from sqlalchemy.sql.expression import false
|
from sqlalchemy.sql.expression import false
|
||||||
from sqlalchemy.sql import func
|
from sqlalchemy.sql import func
|
||||||
|
from sqlalchemy import update
|
||||||
|
|
||||||
from octavia.common import constants as consts
|
from octavia.common import constants as consts
|
||||||
from octavia.common import data_models
|
from octavia.common import data_models
|
||||||
@ -1678,28 +1682,99 @@ class AmphoraHealthRepository(BaseRepository):
|
|||||||
# In this case, the amphora is expired.
|
# In this case, the amphora is expired.
|
||||||
return amphora_model is None
|
return amphora_model is None
|
||||||
|
|
||||||
def get_stale_amphora(self, session):
|
def get_stale_amphora(self,
|
||||||
|
lock_session: Session) -> Optional[models.Amphora]:
|
||||||
"""Retrieves a stale amphora from the health manager database.
|
"""Retrieves a stale amphora from the health manager database.
|
||||||
|
|
||||||
:param session: A Sql Alchemy database session.
|
:param lock_session: A Sql Alchemy database autocommit session.
|
||||||
:returns: [octavia.common.data_model]
|
:returns: [octavia.common.data_model]
|
||||||
"""
|
"""
|
||||||
|
|
||||||
timeout = CONF.health_manager.heartbeat_timeout
|
timeout = CONF.health_manager.heartbeat_timeout
|
||||||
expired_time = datetime.datetime.utcnow() - datetime.timedelta(
|
expired_time = datetime.datetime.utcnow() - datetime.timedelta(
|
||||||
seconds=timeout)
|
seconds=timeout)
|
||||||
|
|
||||||
amp = session.query(self.model_class).with_for_update().filter_by(
|
# Update any amphora that were previously FAILOVER_STOPPED
|
||||||
busy=False).filter(
|
# but are no longer expired.
|
||||||
self.model_class.last_update < expired_time).order_by(
|
self.update_failover_stopped(lock_session, expired_time)
|
||||||
func.random()).first()
|
|
||||||
|
|
||||||
if amp is None:
|
# Handle expired amphora
|
||||||
|
expired_ids_query = select(self.model_class.amphora_id).where(
|
||||||
|
self.model_class.busy == false()).where(
|
||||||
|
self.model_class.last_update < expired_time)
|
||||||
|
|
||||||
|
expired_count = lock_session.scalar(
|
||||||
|
select(func.count()).select_from(expired_ids_query))
|
||||||
|
|
||||||
|
threshold = CONF.health_manager.failover_threshold
|
||||||
|
if threshold is not None and expired_count >= threshold:
|
||||||
|
LOG.error('Stale amphora count reached the threshold '
|
||||||
|
'(%(th)s). %(count)s amphorae were set into '
|
||||||
|
'FAILOVER_STOPPED status.',
|
||||||
|
{'th': threshold, 'count': expired_count})
|
||||||
|
lock_session.execute(
|
||||||
|
update(
|
||||||
|
models.Amphora
|
||||||
|
).where(
|
||||||
|
models.Amphora.status.notin_(
|
||||||
|
[consts.DELETED, consts.PENDING_DELETE])
|
||||||
|
).where(
|
||||||
|
models.Amphora.id.in_(expired_ids_query)
|
||||||
|
).values(
|
||||||
|
status=consts.AMPHORA_FAILOVER_STOPPED
|
||||||
|
).execution_options(synchronize_session="fetch"))
|
||||||
return None
|
return None
|
||||||
|
|
||||||
amp.busy = True
|
# We don't want to attempt to failover amphora that are not
|
||||||
|
# currently in the ALLOCATED or FAILOVER_STOPPED state.
|
||||||
|
# i.e. Not DELETED, PENDING_*, etc.
|
||||||
|
allocated_amp_ids_subquery = (
|
||||||
|
select(models.Amphora.id).where(
|
||||||
|
models.Amphora.status.in_(
|
||||||
|
[consts.AMPHORA_ALLOCATED,
|
||||||
|
consts.AMPHORA_FAILOVER_STOPPED])))
|
||||||
|
|
||||||
return amp.to_data_model()
|
# Pick one expired amphora for automatic failover
|
||||||
|
amp_health = lock_session.query(
|
||||||
|
self.model_class
|
||||||
|
).with_for_update(
|
||||||
|
).filter(
|
||||||
|
self.model_class.amphora_id.in_(expired_ids_query)
|
||||||
|
).filter(
|
||||||
|
self.model_class.amphora_id.in_(allocated_amp_ids_subquery)
|
||||||
|
).order_by(
|
||||||
|
func.random()
|
||||||
|
).limit(1).first()
|
||||||
|
|
||||||
|
if amp_health is None:
|
||||||
|
return None
|
||||||
|
|
||||||
|
amp_health.busy = True
|
||||||
|
|
||||||
|
return amp_health.to_data_model()
|
||||||
|
|
||||||
|
def update_failover_stopped(self, lock_session: Session,
|
||||||
|
expired_time: datetime) -> None:
|
||||||
|
"""Updates the status of amps that are FAILOVER_STOPPED."""
|
||||||
|
# Update any FAILOVER_STOPPED amphora that are no longer stale
|
||||||
|
# back to ALLOCATED.
|
||||||
|
# Note: This uses sqlalchemy 2.0 syntax
|
||||||
|
not_expired_ids_subquery = (
|
||||||
|
select(self.model_class.amphora_id).where(
|
||||||
|
self.model_class.busy == false()
|
||||||
|
).where(
|
||||||
|
self.model_class.last_update >= expired_time
|
||||||
|
))
|
||||||
|
|
||||||
|
# Note: mysql and sqlite do not support RETURNING, so we cannot
|
||||||
|
# get back the affected amphora IDs. (09/2022)
|
||||||
|
lock_session.execute(
|
||||||
|
update(models.Amphora).where(
|
||||||
|
models.Amphora.status == consts.AMPHORA_FAILOVER_STOPPED
|
||||||
|
).where(
|
||||||
|
models.Amphora.id.in_(not_expired_ids_subquery)
|
||||||
|
).values(
|
||||||
|
status=consts.AMPHORA_ALLOCATED
|
||||||
|
).execution_options(synchronize_session="fetch"))
|
||||||
|
|
||||||
|
|
||||||
class VRRPGroupRepository(BaseRepository):
|
class VRRPGroupRepository(BaseRepository):
|
||||||
|
@ -4144,12 +4144,31 @@ class AmphoraRepositoryTest(BaseRepositoryTest):
|
|||||||
class AmphoraHealthRepositoryTest(BaseRepositoryTest):
|
class AmphoraHealthRepositoryTest(BaseRepositoryTest):
|
||||||
def setUp(self):
|
def setUp(self):
|
||||||
super().setUp()
|
super().setUp()
|
||||||
|
self._fake_ip_gen = (self.FAKE_IP + str(ip_end) for ip_end in
|
||||||
|
range(100))
|
||||||
self.amphora = self.amphora_repo.create(self.session,
|
self.amphora = self.amphora_repo.create(self.session,
|
||||||
id=self.FAKE_UUID_1,
|
id=self.FAKE_UUID_1,
|
||||||
compute_id=self.FAKE_UUID_3,
|
compute_id=self.FAKE_UUID_3,
|
||||||
status=constants.ACTIVE,
|
status=constants.ACTIVE,
|
||||||
lb_network_ip=self.FAKE_IP)
|
lb_network_ip=self.FAKE_IP)
|
||||||
|
|
||||||
|
def create_amphora(self, amphora_id, **overrides):
|
||||||
|
fake_ip = next(self._fake_ip_gen)
|
||||||
|
settings = {
|
||||||
|
'id': amphora_id,
|
||||||
|
'compute_id': uuidutils.generate_uuid(),
|
||||||
|
'status': constants.ACTIVE,
|
||||||
|
'lb_network_ip': fake_ip,
|
||||||
|
'vrrp_ip': fake_ip,
|
||||||
|
'ha_ip': fake_ip,
|
||||||
|
'role': constants.ROLE_MASTER,
|
||||||
|
'cert_expiration': datetime.datetime.utcnow(),
|
||||||
|
'cert_busy': False
|
||||||
|
}
|
||||||
|
settings.update(overrides)
|
||||||
|
amphora = self.amphora_repo.create(self.session, **settings)
|
||||||
|
return amphora
|
||||||
|
|
||||||
def create_amphora_health(self, amphora_id):
|
def create_amphora_health(self, amphora_id):
|
||||||
newdate = datetime.datetime.utcnow() - datetime.timedelta(minutes=10)
|
newdate = datetime.datetime.utcnow() - datetime.timedelta(minutes=10)
|
||||||
|
|
||||||
@ -4216,10 +4235,94 @@ class AmphoraHealthRepositoryTest(BaseRepositoryTest):
|
|||||||
self.session)
|
self.session)
|
||||||
self.assertIsNone(stale_amphora)
|
self.assertIsNone(stale_amphora)
|
||||||
|
|
||||||
self.create_amphora_health(self.amphora.id)
|
uuid = uuidutils.generate_uuid()
|
||||||
|
self.create_amphora(uuid)
|
||||||
|
self.amphora_repo.update(self.session, uuid,
|
||||||
|
status=constants.AMPHORA_ALLOCATED)
|
||||||
|
self.create_amphora_health(uuid)
|
||||||
stale_amphora = self.amphora_health_repo.get_stale_amphora(
|
stale_amphora = self.amphora_health_repo.get_stale_amphora(
|
||||||
self.session)
|
self.session)
|
||||||
self.assertEqual(self.amphora.id, stale_amphora.amphora_id)
|
self.assertEqual(uuid, stale_amphora.amphora_id)
|
||||||
|
|
||||||
|
def test_get_stale_amphora_past_threshold(self):
|
||||||
|
conf = self.useFixture(oslo_fixture.Config(cfg.CONF))
|
||||||
|
conf.config(group='health_manager', failover_threshold=3)
|
||||||
|
|
||||||
|
stale_amphora = self.amphora_health_repo.get_stale_amphora(
|
||||||
|
self.session)
|
||||||
|
self.assertIsNone(stale_amphora)
|
||||||
|
|
||||||
|
# Two stale amphora expected, should return that amp
|
||||||
|
# These will go into failover and be marked "busy"
|
||||||
|
uuids = []
|
||||||
|
for _ in range(2):
|
||||||
|
uuid = uuidutils.generate_uuid()
|
||||||
|
uuids.append(uuid)
|
||||||
|
self.create_amphora(uuid)
|
||||||
|
self.amphora_repo.update(self.session, uuid,
|
||||||
|
status=constants.AMPHORA_ALLOCATED)
|
||||||
|
self.create_amphora_health(uuid)
|
||||||
|
stale_amphora = self.amphora_health_repo.get_stale_amphora(
|
||||||
|
self.session)
|
||||||
|
self.assertIn(stale_amphora.amphora_id, uuids)
|
||||||
|
|
||||||
|
# Creating more stale amphorae should return no amps (past threshold)
|
||||||
|
stale_uuids = []
|
||||||
|
for _ in range(4):
|
||||||
|
uuid = uuidutils.generate_uuid()
|
||||||
|
stale_uuids.append(uuid)
|
||||||
|
self.create_amphora(uuid)
|
||||||
|
self.amphora_repo.update(self.session, uuid,
|
||||||
|
status=constants.AMPHORA_ALLOCATED)
|
||||||
|
self.create_amphora_health(uuid)
|
||||||
|
stale_amphora = self.amphora_health_repo.get_stale_amphora(
|
||||||
|
self.session)
|
||||||
|
self.assertIsNone(stale_amphora)
|
||||||
|
num_fo_stopped = self.session.query(db_models.Amphora).filter(
|
||||||
|
db_models.Amphora.status == constants.AMPHORA_FAILOVER_STOPPED
|
||||||
|
).count()
|
||||||
|
# Note that the two amphora started failover, so are "busy" and
|
||||||
|
# should not be marked FAILOVER_STOPPED.
|
||||||
|
self.assertEqual(4, num_fo_stopped)
|
||||||
|
|
||||||
|
# One recovered, but still over threshold
|
||||||
|
# Two "busy", One fully healthy, three in FAILOVER_STOPPED
|
||||||
|
amp = self.session.query(db_models.AmphoraHealth).filter_by(
|
||||||
|
amphora_id=stale_uuids[2]).first()
|
||||||
|
amp.last_update = datetime.datetime.utcnow()
|
||||||
|
self.session.flush()
|
||||||
|
stale_amphora = self.amphora_health_repo.get_stale_amphora(
|
||||||
|
self.session)
|
||||||
|
self.assertIsNone(stale_amphora)
|
||||||
|
num_fo_stopped = self.session.query(db_models.Amphora).filter(
|
||||||
|
db_models.Amphora.status == constants.AMPHORA_FAILOVER_STOPPED
|
||||||
|
).count()
|
||||||
|
self.assertEqual(3, num_fo_stopped)
|
||||||
|
|
||||||
|
# Another one recovered, now below threshold
|
||||||
|
# Two are "busy", Two are fully healthy, Two are in FAILOVER_STOPPED
|
||||||
|
amp = self.session.query(db_models.AmphoraHealth).filter_by(
|
||||||
|
amphora_id=stale_uuids[3]).first()
|
||||||
|
amp.last_update = datetime.datetime.utcnow()
|
||||||
|
stale_amphora = self.amphora_health_repo.get_stale_amphora(
|
||||||
|
self.session)
|
||||||
|
self.assertIsNotNone(stale_amphora)
|
||||||
|
num_fo_stopped = self.session.query(db_models.Amphora).filter(
|
||||||
|
db_models.Amphora.status == constants.AMPHORA_FAILOVER_STOPPED
|
||||||
|
).count()
|
||||||
|
self.assertEqual(2, num_fo_stopped)
|
||||||
|
|
||||||
|
# After error recovery all amps should be allocated again
|
||||||
|
now = datetime.datetime.utcnow()
|
||||||
|
for amp in self.session.query(db_models.AmphoraHealth).all():
|
||||||
|
amp.last_update = now
|
||||||
|
stale_amphora = self.amphora_health_repo.get_stale_amphora(
|
||||||
|
self.session)
|
||||||
|
self.assertIsNone(stale_amphora)
|
||||||
|
num_allocated = self.session.query(db_models.Amphora).filter(
|
||||||
|
db_models.Amphora.status == constants.AMPHORA_ALLOCATED
|
||||||
|
).count()
|
||||||
|
self.assertEqual(5, num_allocated)
|
||||||
|
|
||||||
def test_create(self):
|
def test_create(self):
|
||||||
amphora_health = self.create_amphora_health(self.FAKE_UUID_1)
|
amphora_health = self.create_amphora_health(self.FAKE_UUID_1)
|
||||||
|
12
releasenotes/notes/failover-threshold-f5cdf2bbe8a64d6d.yaml
Normal file
12
releasenotes/notes/failover-threshold-f5cdf2bbe8a64d6d.yaml
Normal file
@ -0,0 +1,12 @@
|
|||||||
|
---
|
||||||
|
features:
|
||||||
|
- |
|
||||||
|
A new configuration option ``failover_threshold`` can be set to limit the
|
||||||
|
number of amphorae simultaneously pending failover before halting the
|
||||||
|
automatic failover process. This should help prevent unwanted mass failover
|
||||||
|
events that can happen in cases like network interruption to an AZ or the
|
||||||
|
database becoming read-only. This feature is not enabled by default, and it
|
||||||
|
should be configured carefully based on the size of the environment.
|
||||||
|
For example, with 100 amphorae a good threshold might be 20 or 30, or
|
||||||
|
a value greater than the typical number of amphorae that would be expected
|
||||||
|
on a single host.
|
@ -46,7 +46,7 @@ castellan>=0.16.0 # Apache-2.0
|
|||||||
tenacity>=5.0.4 # Apache-2.0
|
tenacity>=5.0.4 # Apache-2.0
|
||||||
distro>=1.2.0 # Apache-2.0
|
distro>=1.2.0 # Apache-2.0
|
||||||
jsonschema>=3.2.0 # MIT
|
jsonschema>=3.2.0 # MIT
|
||||||
octavia-lib>=2.5.0 # Apache-2.0
|
octavia-lib>=3.1.0 # Apache-2.0
|
||||||
simplejson>=3.13.2 # MIT
|
simplejson>=3.13.2 # MIT
|
||||||
setproctitle>=1.1.10 # BSD
|
setproctitle>=1.1.10 # BSD
|
||||||
python-dateutil>=2.7.0 # BSD
|
python-dateutil>=2.7.0 # BSD
|
||||||
|
Loading…
Reference in New Issue
Block a user