neutron/doc/source/devref/retries.rst
Kevin Benton 09c87425fa Prepare retry decorator to move to plugin level
Retrying mutating operations at the API layer caused a
couple of problems. First, when components would call
the core plugin using the neutron manager, they would
have to handle the retriable failures on their own or
undo any work they had done so far and allow retriable
failures to be propagated up to the API. Second, retrying
at the API makes composite operations (e.g. auto allocate,
add_router_interface, etc) painful because they have to
consider if exceptions are retriable before raising
fatal exceptions on failures of core plugin calls.

This patch begins the process of moving them down to the
core operations with a new decorator called
'retry_if_session_inactive', which ensures that the
retry logic isn't triggered if there is an ongoing transaction
since retrying inside of a transaction is normally ineffective.
Follow-up patches apply them to various parts of the code-base.

Additionally, the args and kwargs of the method are automatically
deep copied in retries to ensure that any mangling the methods
do to their arguments don't impact their retriability.

Finally, since we are leaving the API decorators in place for now,
the retry logic will not be triggered by another decorator if an
exception has already been retried. This prevents an exponential
explosion of retries on nested retry decorators.

The ultimate goal will be to get rid of the API decorators entirely
so retries are up to each individual plugin.

Partial-Bug: #1596075
Partial-Bug: #1612798
Change-Id: I7b8a4a105aabfa1b5f5dd7a638099007b0933e66
2016-09-08 14:07:08 -07:00

6.7 KiB

Retrying Operations

Inside of the neutron.db.api module there is a decorator called 'retry_if_session_inactive'. This should be used to protect any functions that perform DB operations. This decorator will capture any deadlock errors, RetryRequests, connection errors, and unique constraint violations that are thrown by the function it is protecting.

This decorator will not retry an operation if the function it is applied to is called within an active session. This is because the majority of the exceptions it captures put the session into a partially rolled back state so it is no longer usable. It is important to ensure there is a decorator outside of the start of the transaction. The decorators are safe to nest if a function is sometimes called inside of another transaction.

If a function is being protected that does not take context as an argument the 'retry_db_errors' decorator function may be used instead. It retries the same exceptions and has the same anti-nesting behavior as 'retry_if_session_active', but it does not check if a session is attached to any context keywords. ('retry_if_session_active' just uses 'retry_db_errors' internally after checking the session)

Idempotency on Failures

The function that is being decorated should always fully cleanup whenever it encounters an exception so its safe to retry the operation. So if a function creates a DB object, commits, then creates another, the function must have a cleanup handler to remove the first DB object in the case that the second one fails. Assume any DB operation can throw a retriable error.

You may see some retry decorators at the API layers in Neutron; however, we are trying to eliminate them because each API operation has many independent steps that makes ensuring idempotency on partial failures very difficult.

Argument Mutation

A decorated function should not mutate any complex arguments which are passed into it. If it does, it should have an exception handler that reverts the change so it's safe to retry.

The decorator will automatically create deep copies of sets, lists, and dicts which are passed through it, but it will leave the other arguments alone.

Retrying to Handle Race Conditions

One of the difficulties with detecting race conditions to create a DB record with a unique constraint is determining where to put the exception handler because a constraint violation can happen immediately on flush or it may not happen all of the way until the transaction is being committed on the exit of the session context manager. So we would end up with code that looks something like this:

def create_port(context, ip_address, mac_address):
    _ensure_mac_not_in_use(context, mac_address)
    _ensure_ip_not_in_use(context, ip_address)
    try:
        with context.session.begin():
           port_obj = Port(ip=ip_address, mac=mac_address)
           do_expensive_thing(...)
           do_extra_other_thing(...)
           return port_obj
    except DBDuplicateEntry as e:
        # code to parse columns
        if 'mac' in e.columns:
            raise MacInUse(mac_address)
        if 'ip' in e.columns:
            raise IPAddressInUse(ip)

def _ensure_mac_not_in_use(context, mac):
    if context.session.query(Port).filter_by(mac=mac).count():
        raise MacInUse(mac)

def _ensure_ip_not_in_use(context, ip):
    if context.session.query(Port).filter_by(ip=ip).count():
        raise IPAddressInUse(ip)

So we end up with an exception handler that has to understand where things went wrong and convert them into appropriate exceptions for the end-users. This distracts significantly from the main purpose of create_port.

Since the retry decorator will automatically catch and retry DB duplicate errors for us, we can allow it to retry on this race condition which will give the original validation logic to be re-executed and raise the appropriate error. This keeps validation logic in one place and makes the code cleaner.

from neutron.db import api as db_api

@db_api.retry_if_session_inactive()
def create_port(context, ip_address, mac_address):
    _ensure_mac_not_in_use(context, mac_address)
    _ensure_ip_not_in_use(context, ip_address)
    with context.session.begin():
       port_obj = Port(ip=ip_address, mac=mac_address)
       do_expensive_thing(...)
       do_extra_other_thing(...)
       return port_obj

def _ensure_mac_not_in_use(context, mac):
    if context.session.query(Port).filter_by(mac=mac).count():
        raise MacInUse(mac)

def _ensure_ip_not_in_use(context, ip):
    if context.session.query(Port).filter_by(ip=ip).count():
        raise IPAddressInUse(ip)

Nesting

Once the decorator retries an operation the maximum number of times, it will attach a flag to the exception it raises further up that will prevent decorators around the calling functions from retrying the error again. This prevents an exponential increase in the number of retries if they are layered.

Usage

Here are some usage examples:

from neutron.db import api as db_api

@db_api.retry_if_session_inactive()
def create_elephant(context, elephant_details):
    ....

@db_api.retry_if_session_inactive()
def atomic_bulk_create_elephants(context, elephants):
    with context.session.begin():
        for elephant in elephants:
            # note that if create_elephant throws a retriable
            # exception, the decorator around it will not retry
            # because the session is active. The decorator around
            # atomic_bulk_create_elephants will be responsible for
            # retrying the entire operation.
            create_elephant(context, elephant)

# sample usage when session is attached to a var other than 'context'
@db_api.retry_if_session_inactive(context_var_name='ctx')
def some_function(ctx):
    ...