Add spec for push notification refactor

Adds a spec to change the method we use to get information from the server to the agents. Rather than the server notifying the agent to call the server, we can just put the relevant data in the notification itself to improve scalability and reliability. The bulk of this spec is dealing with the message ordering guarantee we will need to accomplish this. It also has some work items to help improve our current pattern. Related-Bug: #1516195 Change-Id: I3af200ad84483e6e1fe619d516ff20bc87041f7c
2015-09-17 05:57:07 -07:00
parent 1651d6a0bd
commit 875253ecec
1 changed files with 335 additions and 0 deletions
--- a/specs/newton/push-notifications.rst
+++ b/specs/newton/push-notifications.rst
@@ -0,0 +1,335 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+=============================
+Push Notifications for Agents
+=============================
+
+RFE:
+https://bugs.launchpad.net/neutron/+bug/1516195
+
+Launchpad blueprint:
+https://blueprints.launchpad.net/neutron/+spec/push-notifications
+
+The current method we use to get information from the server to the agent
+is driven by notification and error-triggered calls to the server by the agent.
+So during normal operation, the server will send out a notification that a
+specific object has changed (e.g. a port) and then the agent will respond
+to that by querying the server for information about that port. If the
+agent encounters a failure while processing changes, it will start over
+and re-query the server in the process.
+
+The load on the server from this agent-driven approach can be very
+unpredictable depending on the changes to object states on the neutron
+server. For example, a single network update will result in a query from
+every L2 agent with a port on that network.
+
+This blueprint aims to change the pattern we use to get information to the
+agents to primarily be based on pushing the object state out in the change
+notifications. For anything not changed to leverage this method of retrieval
+(e.g. initial agent startup still needs to poll), the AMQP timeout handling
+will be fixed to ensure it has an exponential back-off to prevent the agents
+from stampeding the server.
+
+
+Problem Description
+===================
+
+An outage of a few agents and their recovery can lead to all of the agents
+drowning the neutron servers with requests. This can cause the neutron servers
+to fail to respond in time, which results in more retry requests building up,
+leaving the entire system useless until operator intervention.
+
+This is caused by 3 problems:
+
+* We don't make optimal use of server notifications. There are times when
+  the server will send a notification to an agent to inform it that something
+  has changed. Then the agent has to make a call back to the server to get the
+  relevant details. This means a single L3 rescheduling event of a set of
+  routers due to a failed L3 agent can result in N more calls to the server
+  where N is the number of routers. Compounding this issue, a single agent
+  may make multiple calls to the server for a single operation (e.g. the L2
+  agent will make one call for port info, and then another for security group
+  info).
+
+* The agents will give up after a short period of time on a request and retry
+  the request or issue an even more expensive request (e.g. if synchronizing
+  info for one item fails, a major issue is assumed so a request to sync all
+  items is issued). So by the time the server finishes fulfilling a request,
+  the client is no longer waiting for the response so it goes in the
+  trash. As this compounds, it leaves the server processing a massive queue
+  of requests that won't even have listeners for the responses.
+
+* Related to the second item is the fact that the agents are aggressive in
+  their retry mechanisms. If a request times out, that request is immediately
+  retried with the same timeout value; that is, they have no back-off
+  mechanism. (This has now been addressed by
+  https://review.openstack.org/#/c/280595/ which adds backoff,
+  sleep, and jitter.)
+
+
+Proposed Change
+===============
+
+Eliminate expensive cases where calls are made to the neutron server in
+response to a notification generated by the server. In most of these cases
+where the agent is just asking for regular neutron objects
+(e.g. ports, networks), we can leverage the RPC callbacks mechanism
+introduced in Liberty[1] to have the server send the entire changed object
+as part of the notification so the agent has the information it needs.
+
+The main targets for this will be the security group info call,
+the get_device_details call, and the sync_routers call. Others will be
+included if the change is trivial once these three are done.
+The DHCP agent already relies on push notifications, so it will just
+be updated to use the revision number to detect the out of order events
+it's susceptible to now.
+
+For the remaining calls that cannot easily be converted into the callbacks
+mechanism (e.g. the security groups call which blends several objects,
+the initial synchronization mechanism, and agent-generated calls), a nicer
+timeout mechanism will be implemented with an exponential back-off and timeout
+increase so a heavily loaded server is not continuously hammered to death.
+
+
+Changes to RPC callback mechanism
+---------------------------------
+
+The current issue with the RPC callback mechanism and sending objects as
+notifications is a lack of server operation ordering guarantees and
+AMQP message ordering guarantees.
+
+To illustrate the first issue, examine the following order of events that
+happen when two servers update the same port:
+
+* Server 1 commits update to DB
+* Server 2 commits update to DB
+* Server 2 sends notification
+* Server 1 sends notification
+
+If the agent receives the notifications in the order in which they are
+delivered to AMQP, it will think the state delivered by Server 1 is the
+current state when it is actually the state committed by Server 2.
+
+We also have the same issue when oslo messaging doesn't guarantee
+message order (e.g. ZeroMQ). Even if Server 1 sends immediately after
+its commit and before Server 2 commits and sends, one or more of the
+agents could end up seeing Server 2's message before Server 1's.
+
+To handle this, we will add a revision number, implemented as a monotonic
+counter, to each object. This counter will be incremented on any update
+so any agent can immediately identify stale messages.
+
+To address deletes arriving before updates, agents will be expected
+to keep a set of the UUIDs that have been deleted. Upon receiving an update,
+the agent will check this set for the object's UUID and ignore the update
+since deletes are permanent and UUIDs cannot be re-used. If we do make IDs
+recyclable in the future, this can be replaced with a strategy to confirm
+ID existence with the server or we can add another internal UUID that
+cannot be specified.
+
+Note that this doesn't guaruntee message ordering for the agent because
+that is a property of the message backend, but it does give the agent the
+necessary info to re-order messages when it receives them so they can
+determine which one reflects the more recent state of the DB.
+
+
+Data Model Impact
+-----------------
+
+A 'revision_number' column will be added to the standard attr table. This
+column will just be a simple big integer used as monotonic counter that
+will be updated whenever the object is updated on the neutron server.
+This revision number can then be used by the agents to automatically
+discard any object states that are older than the state they already have.
+
+This revision_number will use the version counter feature which is built-in
+to SQLAlchemy: http://docs.sqlalchemy.org/en/latest/orm/versioning.html
+Each time an object is updated, the server will perform a compare-and-swap
+operation based on the revision number. This ensures that each update must
+start with the current revision number or it will fail with a StaleDataError.
+The API layer can catch this error with the current DB retry mechanism and
+start over with the latest revision number.
+
+While SQLAlchemy will automatically bump the revision for us when the record
+for an object is updated (e.g. a standard attr description field), it will
+not update it if it's a related object changing (e.g. adding an IP address
+to the port or changing its status). So we will have to manually trigger
+the revision bump (either via a PRECOMMIT callback or inline code) for
+any operations that we want to trigger the revision number bump.
+
+What this guarantees:
+
+- An object in a notification is newer (from a DB state perspective) than
+  an object with a lower revision number. So any objects with lower revision
+  numbers can safely be ignored since they represent stale DB state.
+
+What this doesn't guarantee:
+
+- Message ordering 'on the wire'. An AMQP listener may end up receiving an
+  older state than a message it has already received. It's up to the listener
+  to look at the revision number to determine if the message is stale.
+- That each intermediary state is transmitted. If a notification mechanism
+  reads the DB to get the full object to send, the DB state may have progressed
+  so it will notify with the latest state than the state that triggered the
+  original notification. This is acceptable for all of our use cases since we
+  only care about the current state of the object to wire up the dataplane. It
+  is also effectively what we have now since the DB state could change between
+  when the agent gets a notification and when it actually asks the server for
+  details.
+- Reliability of the notifications themselves. This doesn't address the issue
+  we currently have where a dropped notification is not detected.
+
+
+Notifications Impact
+--------------------
+
+Making existing notifications significantly more data-rich. The hope here is
+to eliminate many of the expensive RPC calls that each agent makes and have
+each agent derive all state from notifications with one sync method for
+recovery/initialization that we can focus on optimizing.
+
+This will result in more data being sent up front by the server to the
+messaging layer, but it will eliminate the data that would be sent in
+response to a call request from the agent in the current pattern. For a
+single agent, the only gain is the elimination of the notification and
+call messages; but for multiple agents interested in the same resource,
+it eliminates extra DB calls and extra messages from the server to fulfill
+those calls.
+
+This pattern will result in fewer messages sent to oslo messaging because
+of the elimination of the calls from the agents that would result in the
+same payload we are preemptively broadcasting once instead of casting
+multiple times to each requesting agent.
+
+
+Performance Impact
+------------------
+
+Higher ratio of neutron agents per server afforded by a large reduction in
+sporadic queries by the agents.
+
+This comes at a cost of effectively serializing operations on an individual
+object due to the compare and swap operation on the server. For example,
+if two server threads try to update a single object concurrently and both
+read the current state of the object at the same time, one will fail on
+commit with a StaleDataError which will be retried by the API layer.
+Previously both of these would succeed because the UPDATE statement
+would have no compare-and-swap WHERE criteria. However, this is a very
+reasonable performance cost to pay considering that concurrent updates to
+the same API object are not common.
+
+
+Other Deployer Impact
+---------------------
+
+N/A - upgrade path will maintain normal N-1 backward compatibility on the
+server so all of the current RPC endpoints will be left untouched for one
+cycle.
+
+
+Developer Impact
+----------------
+
+Need to change development guidelines to avoid the implementation of new
+direct server calls.
+
+The notifications will have to send out oslo versioned objects since
+notifications don't have RPC versions. So at a minimum we need to
+switch to oslo versioned objects in the notification code if we
+can't get them fully implemented everywhere else. To do this we
+can leverage the RPC callbacks mechanism.
+
+
+Alternatives
+------------
+
+Maintain the current information retrieval pattern and just adjust the timeout
+mechanism for everything to include back-offs or use cast/cast instead of
+calls. This will allow a system to automatically recover from self-induced
+death by stampede, but it will not make the performance any more predictable.
+
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Primary assignee:
+  kevinbenton
+  Ihar Hrachyshka
+
+
+Work Items
+----------
+
+* Exponential back-off for timeouts on agents
+* Implement 'revision' extension to add the revision_number column to the
+  data-model and expose it as a standard attribute.
+* Write tests to ensure revisions are incremented as expected
+* Write (at least one) test that verifies a StaleDataError is triggered
+  in the event of concurrent updates.
+* Update DHCP agent to make use of this new 'revision' field to discard stale
+  updates. This will be used as the proof of concept for this approach since
+  the DHCP agent is currently exposed to operating on stale data with out of
+  order messages.
+* Replace the use of sync_routers calls on the L3 agents for the most frequent
+  operations (e.g. floating IP associations, etc) with RPC callbacks once the
+  OVO work allows it.
+* Stand up grenade partial job to make sure agents using different OVO versions
+  maintain N-1 compatibility
+* Update devref for callbacks
+
+
+Possible Future Work
+--------------------
+* Switch to cast/cast pattern so agent isn't blocked waiting on server
+* Setup a periodic system based on these revision numbers to have the agents
+  figure out if they have lost updates from the server. (e.g. periodic
+  broadcasts of revision numbers and UUIDs, sums of collections of revisions,
+  etc.).
+* Add an 'RPC pain multiplier' option that just causes all calls to the
+  neutron server to be duplicated X number of times. That way we can set
+  it to something like 200 for the gate which will force us to make every
+  call reasonably performant.
+* Allow the HTTP API to perform compare and swap updates by placing an if-match
+  header with the revision number, which would cause the update to fail if
+  the version changed.
+
+
+Testing
+=======
+
+* The grenade partial job will be important to ensure we maintain our N-1
+  backward compatibility with agents from the previous release.
+* API tests will be added to ensure the basic operation of the revision numbers
+* Functional and unit tests to test the agent reactions to payloads
+
+
+Documentation Impact
+====================
+
+
+User Documentation
+------------------
+
+N/A
+
+
+Developer Documentation
+-----------------------
+
+Devref guidelines on the pattern for getting information to agents and what
+the acceptability criteria are for calls to the server.
+
+RPC callbacks devref will need to be updated with notification strategy.
+
+References
+==========
+
+1. http://git.openstack.org/cgit/openstack/neutron/tree/doc/source/devref/rpc_callbacks.rst
+2. https://www.rabbitmq.com/semantics.html#ordering