Browse Source

Add spec for push notification refactor

Adds a spec to change the method we use to get
information from the server to the agents. Rather
than the server notifying the agent to call the server,
we can just put the relevant data in the notification
itself to improve scalability and reliability.

The bulk of this spec is dealing with the message ordering
guarantee we will need to accomplish this. It also has
some work items to help improve our current pattern.

Related-Bug: #1516195
Change-Id: I3af200ad84483e6e1fe619d516ff20bc87041f7c
Kevin Benton 5 years ago
committed by Kevin Benton
1 changed files with 335 additions and 0 deletions
  1. +335

+ 335
- 0
specs/newton/push-notifications.rst View File

@@ -0,0 +1,335 @@
This work is licensed under a Creative Commons Attribution 3.0 Unported

Push Notifications for Agents


Launchpad blueprint:

The current method we use to get information from the server to the agent
is driven by notification and error-triggered calls to the server by the agent.
So during normal operation, the server will send out a notification that a
specific object has changed (e.g. a port) and then the agent will respond
to that by querying the server for information about that port. If the
agent encounters a failure while processing changes, it will start over
and re-query the server in the process.

The load on the server from this agent-driven approach can be very
unpredictable depending on the changes to object states on the neutron
server. For example, a single network update will result in a query from
every L2 agent with a port on that network.

This blueprint aims to change the pattern we use to get information to the
agents to primarily be based on pushing the object state out in the change
notifications. For anything not changed to leverage this method of retrieval
(e.g. initial agent startup still needs to poll), the AMQP timeout handling
will be fixed to ensure it has an exponential back-off to prevent the agents
from stampeding the server.

Problem Description

An outage of a few agents and their recovery can lead to all of the agents
drowning the neutron servers with requests. This can cause the neutron servers
to fail to respond in time, which results in more retry requests building up,
leaving the entire system useless until operator intervention.

This is caused by 3 problems:

* We don't make optimal use of server notifications. There are times when
the server will send a notification to an agent to inform it that something
has changed. Then the agent has to make a call back to the server to get the
relevant details. This means a single L3 rescheduling event of a set of
routers due to a failed L3 agent can result in N more calls to the server
where N is the number of routers. Compounding this issue, a single agent
may make multiple calls to the server for a single operation (e.g. the L2
agent will make one call for port info, and then another for security group

* The agents will give up after a short period of time on a request and retry
the request or issue an even more expensive request (e.g. if synchronizing
info for one item fails, a major issue is assumed so a request to sync all
items is issued). So by the time the server finishes fulfilling a request,
the client is no longer waiting for the response so it goes in the
trash. As this compounds, it leaves the server processing a massive queue
of requests that won't even have listeners for the responses.

* Related to the second item is the fact that the agents are aggressive in
their retry mechanisms. If a request times out, that request is immediately
retried with the same timeout value; that is, they have no back-off
mechanism. (This has now been addressed by which adds backoff,
sleep, and jitter.)

Proposed Change

Eliminate expensive cases where calls are made to the neutron server in
response to a notification generated by the server. In most of these cases
where the agent is just asking for regular neutron objects
(e.g. ports, networks), we can leverage the RPC callbacks mechanism
introduced in Liberty[1] to have the server send the entire changed object
as part of the notification so the agent has the information it needs.

The main targets for this will be the security group info call,
the get_device_details call, and the sync_routers call. Others will be
included if the change is trivial once these three are done.
The DHCP agent already relies on push notifications, so it will just
be updated to use the revision number to detect the out of order events
it's susceptible to now.

For the remaining calls that cannot easily be converted into the callbacks
mechanism (e.g. the security groups call which blends several objects,
the initial synchronization mechanism, and agent-generated calls), a nicer
timeout mechanism will be implemented with an exponential back-off and timeout
increase so a heavily loaded server is not continuously hammered to death.

Changes to RPC callback mechanism

The current issue with the RPC callback mechanism and sending objects as
notifications is a lack of server operation ordering guarantees and
AMQP message ordering guarantees.

To illustrate the first issue, examine the following order of events that
happen when two servers update the same port:

* Server 1 commits update to DB
* Server 2 commits update to DB
* Server 2 sends notification
* Server 1 sends notification

If the agent receives the notifications in the order in which they are
delivered to AMQP, it will think the state delivered by Server 1 is the
current state when it is actually the state committed by Server 2.

We also have the same issue when oslo messaging doesn't guarantee
message order (e.g. ZeroMQ). Even if Server 1 sends immediately after
its commit and before Server 2 commits and sends, one or more of the
agents could end up seeing Server 2's message before Server 1's.

To handle this, we will add a revision number, implemented as a monotonic
counter, to each object. This counter will be incremented on any update
so any agent can immediately identify stale messages.

To address deletes arriving before updates, agents will be expected
to keep a set of the UUIDs that have been deleted. Upon receiving an update,
the agent will check this set for the object's UUID and ignore the update
since deletes are permanent and UUIDs cannot be re-used. If we do make IDs
recyclable in the future, this can be replaced with a strategy to confirm
ID existence with the server or we can add another internal UUID that
cannot be specified.

Note that this doesn't guaruntee message ordering for the agent because
that is a property of the message backend, but it does give the agent the
necessary info to re-order messages when it receives them so they can
determine which one reflects the more recent state of the DB.

Data Model Impact

A 'revision_number' column will be added to the standard attr table. This
column will just be a simple big integer used as monotonic counter that
will be updated whenever the object is updated on the neutron server.
This revision number can then be used by the agents to automatically
discard any object states that are older than the state they already have.

This revision_number will use the version counter feature which is built-in
to SQLAlchemy:
Each time an object is updated, the server will perform a compare-and-swap
operation based on the revision number. This ensures that each update must
start with the current revision number or it will fail with a StaleDataError.
The API layer can catch this error with the current DB retry mechanism and
start over with the latest revision number.

While SQLAlchemy will automatically bump the revision for us when the record
for an object is updated (e.g. a standard attr description field), it will
not update it if it's a related object changing (e.g. adding an IP address
to the port or changing its status). So we will have to manually trigger
the revision bump (either via a PRECOMMIT callback or inline code) for
any operations that we want to trigger the revision number bump.

What this guarantees:

- An object in a notification is newer (from a DB state perspective) than
an object with a lower revision number. So any objects with lower revision
numbers can safely be ignored since they represent stale DB state.

What this doesn't guarantee:

- Message ordering 'on the wire'. An AMQP listener may end up receiving an
older state than a message it has already received. It's up to the listener
to look at the revision number to determine if the message is stale.
- That each intermediary state is transmitted. If a notification mechanism
reads the DB to get the full object to send, the DB state may have progressed
so it will notify with the latest state than the state that triggered the
original notification. This is acceptable for all of our use cases since we
only care about the current state of the object to wire up the dataplane. It
is also effectively what we have now since the DB state could change between
when the agent gets a notification and when it actually asks the server for
- Reliability of the notifications themselves. This doesn't address the issue
we currently have where a dropped notification is not detected.

Notifications Impact

Making existing notifications significantly more data-rich. The hope here is
to eliminate many of the expensive RPC calls that each agent makes and have
each agent derive all state from notifications with one sync method for
recovery/initialization that we can focus on optimizing.

This will result in more data being sent up front by the server to the
messaging layer, but it will eliminate the data that would be sent in
response to a call request from the agent in the current pattern. For a
single agent, the only gain is the elimination of the notification and
call messages; but for multiple agents interested in the same resource,
it eliminates extra DB calls and extra messages from the server to fulfill
those calls.

This pattern will result in fewer messages sent to oslo messaging because
of the elimination of the calls from the agents that would result in the
same payload we are preemptively broadcasting once instead of casting
multiple times to each requesting agent.

Performance Impact

Higher ratio of neutron agents per server afforded by a large reduction in
sporadic queries by the agents.

This comes at a cost of effectively serializing operations on an individual
object due to the compare and swap operation on the server. For example,
if two server threads try to update a single object concurrently and both
read the current state of the object at the same time, one will fail on
commit with a StaleDataError which will be retried by the API layer.
Previously both of these would succeed because the UPDATE statement
would have no compare-and-swap WHERE criteria. However, this is a very
reasonable performance cost to pay considering that concurrent updates to
the same API object are not common.

Other Deployer Impact

N/A - upgrade path will maintain normal N-1 backward compatibility on the
server so all of the current RPC endpoints will be left untouched for one

Developer Impact

Need to change development guidelines to avoid the implementation of new
direct server calls.

The notifications will have to send out oslo versioned objects since
notifications don't have RPC versions. So at a minimum we need to
switch to oslo versioned objects in the notification code if we
can't get them fully implemented everywhere else. To do this we
can leverage the RPC callbacks mechanism.


Maintain the current information retrieval pattern and just adjust the timeout
mechanism for everything to include back-offs or use cast/cast instead of
calls. This will allow a system to automatically recover from self-induced
death by stampede, but it will not make the performance any more predictable.



Primary assignee:
Ihar Hrachyshka

Work Items

* Exponential back-off for timeouts on agents
* Implement 'revision' extension to add the revision_number column to the
data-model and expose it as a standard attribute.
* Write tests to ensure revisions are incremented as expected
* Write (at least one) test that verifies a StaleDataError is triggered
in the event of concurrent updates.
* Update DHCP agent to make use of this new 'revision' field to discard stale
updates. This will be used as the proof of concept for this approach since
the DHCP agent is currently exposed to operating on stale data with out of
order messages.
* Replace the use of sync_routers calls on the L3 agents for the most frequent
operations (e.g. floating IP associations, etc) with RPC callbacks once the
OVO work allows it.
* Stand up grenade partial job to make sure agents using different OVO versions
maintain N-1 compatibility
* Update devref for callbacks

Possible Future Work
* Switch to cast/cast pattern so agent isn't blocked waiting on server
* Setup a periodic system based on these revision numbers to have the agents
figure out if they have lost updates from the server. (e.g. periodic
broadcasts of revision numbers and UUIDs, sums of collections of revisions,
* Add an 'RPC pain multiplier' option that just causes all calls to the
neutron server to be duplicated X number of times. That way we can set
it to something like 200 for the gate which will force us to make every
call reasonably performant.
* Allow the HTTP API to perform compare and swap updates by placing an if-match
header with the revision number, which would cause the update to fail if
the version changed.


* The grenade partial job will be important to ensure we maintain our N-1
backward compatibility with agents from the previous release.
* API tests will be added to ensure the basic operation of the revision numbers
* Functional and unit tests to test the agent reactions to payloads

Documentation Impact

User Documentation


Developer Documentation

Devref guidelines on the pattern for getting information to agents and what
the acceptability criteria are for calls to the server.

RPC callbacks devref will need to be updated with notification strategy.