Merge "Add spec for push notification refactor"

2016-05-24 19:16:02 +00:00 · 2016-05-24 19:16:02 +00:00 · a4e76357c6
parent 9192235cd2 875253ecec
commit a4e76357c6
1 changed files with 335 additions and 0 deletions
--- a/specs/newton/push-notifications.rst
+++ b/specs/newton/push-notifications.rst
@ -0,0 +1,335 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+=============================
+Push Notifications for Agents
+=============================
+
+RFE:
+https://bugs.launchpad.net/neutron/+bug/1516195
+
+Launchpad blueprint:
+https://blueprints.launchpad.net/neutron/+spec/push-notifications
+
+The current method we use to get information from the server to the agent
+is driven by notification and error-triggered calls to the server by the agent.
+So during normal operation, the server will send out a notification that a
+specific object has changed (e.g. a port) and then the agent will respond
+to that by querying the server for information about that port. If the
+agent encounters a failure while processing changes, it will start over
+and re-query the server in the process.
+
+The load on the server from this agent-driven approach can be very
+unpredictable depending on the changes to object states on the neutron
+server. For example, a single network update will result in a query from
+every L2 agent with a port on that network.
+
+This blueprint aims to change the pattern we use to get information to the
+agents to primarily be based on pushing the object state out in the change
+notifications. For anything not changed to leverage this method of retrieval
+(e.g. initial agent startup still needs to poll), the AMQP timeout handling
+will be fixed to ensure it has an exponential back-off to prevent the agents
+from stampeding the server.
+
+
+Problem Description
+===================
+
+An outage of a few agents and their recovery can lead to all of the agents
+drowning the neutron servers with requests. This can cause the neutron servers
+to fail to respond in time, which results in more retry requests building up,
+leaving the entire system useless until operator intervention.
+
+This is caused by 3 problems:
+
+* We don't make optimal use of server notifications. There are times when
+  the server will send a notification to an agent to inform it that something
+  has changed. Then the agent has to make a call back to the server to get the
+  relevant details. This means a single L3 rescheduling event of a set of
+  routers due to a failed L3 agent can result in N more calls to the server
+  where N is the number of routers. Compounding this issue, a single agent
+  may make multiple calls to the server for a single operation (e.g. the L2
+  agent will make one call for port info, and then another for security group
+  info).
+
+* The agents will give up after a short period of time on a request and retry
+  the request or issue an even more expensive request (e.g. if synchronizing
+  info for one item fails, a major issue is assumed so a request to sync all
+  items is issued). So by the time the server finishes fulfilling a request,
+  the client is no longer waiting for the response so it goes in the
+  trash. As this compounds, it leaves the server processing a massive queue
+  of requests that won't even have listeners for the responses.
+
+* Related to the second item is the fact that the agents are aggressive in
+  their retry mechanisms. If a request times out, that request is immediately
+  retried with the same timeout value; that is, they have no back-off
+  mechanism. (This has now been addressed by
+  https://review.openstack.org/#/c/280595/ which adds backoff,
+  sleep, and jitter.)
+
+
+Proposed Change
+===============
+
+Eliminate expensive cases where calls are made to the neutron server in
+response to a notification generated by the server. In most of these cases
+where the agent is just asking for regular neutron objects
+(e.g. ports, networks), we can leverage the RPC callbacks mechanism
+introduced in Liberty[1] to have the server send the entire changed object
+as part of the notification so the agent has the information it needs.
+
+The main targets for this will be the security group info call,
+the get_device_details call, and the sync_routers call. Others will be
+included if the change is trivial once these three are done.
+The DHCP agent already relies on push notifications, so it will just
+be updated to use the revision number to detect the out of order events
+it's susceptible to now.
+
+For the remaining calls that cannot easily be converted into the callbacks
+mechanism (e.g. the security groups call which blends several objects,
+the initial synchronization mechanism, and agent-generated calls), a nicer
+timeout mechanism will be implemented with an exponential back-off and timeout
+increase so a heavily loaded server is not continuously hammered to death.
+
+
+Changes to RPC callback mechanism
+---------------------------------
+
+The current issue with the RPC callback mechanism and sending objects as
+notifications is a lack of server operation ordering guarantees and
+AMQP message ordering guarantees.
+
+To illustrate the first issue, examine the following order of events that
+happen when two servers update the same port:
+
+* Server 1 commits update to DB
+* Server 2 commits update to DB
+* Server 2 sends notification
+* Server 1 sends notification
+
+If the agent receives the notifications in the order in which they are
+delivered to AMQP, it will think the state delivered by Server 1 is the
+current state when it is actually the state committed by Server 2.
+
+We also have the same issue when oslo messaging doesn't guarantee
+message order (e.g. ZeroMQ). Even if Server 1 sends immediately after
+its commit and before Server 2 commits and sends, one or more of the
+agents could end up seeing Server 2's message before Server 1's.
+
+To handle this, we will add a revision number, implemented as a monotonic
+counter, to each object. This counter will be incremented on any update
+so any agent can immediately identify stale messages.
+
+To address deletes arriving before updates, agents will be expected
+to keep a set of the UUIDs that have been deleted. Upon receiving an update,
+the agent will check this set for the object's UUID and ignore the update
+since deletes are permanent and UUIDs cannot be re-used. If we do make IDs
+recyclable in the future, this can be replaced with a strategy to confirm
+ID existence with the server or we can add another internal UUID that
+cannot be specified.
+
+Note that this doesn't guaruntee message ordering for the agent because
+that is a property of the message backend, but it does give the agent the
+necessary info to re-order messages when it receives them so they can
+determine which one reflects the more recent state of the DB.
+
+
+Data Model Impact
+-----------------
+
+A 'revision_number' column will be added to the standard attr table. This
+column will just be a simple big integer used as monotonic counter that
+will be updated whenever the object is updated on the neutron server.
+This revision number can then be used by the agents to automatically
+discard any object states that are older than the state they already have.
+
+This revision_number will use the version counter feature which is built-in
+to SQLAlchemy: http://docs.sqlalchemy.org/en/latest/orm/versioning.html
+Each time an object is updated, the server will perform a compare-and-swap
+operation based on the revision number. This ensures that each update must
+start with the current revision number or it will fail with a StaleDataError.
+The API layer can catch this error with the current DB retry mechanism and
+start over with the latest revision number.
+
+While SQLAlchemy will automatically bump the revision for us when the record
+for an object is updated (e.g. a standard attr description field), it will
+not update it if it's a related object changing (e.g. adding an IP address
+to the port or changing its status). So we will have to manually trigger
+the revision bump (either via a PRECOMMIT callback or inline code) for
+any operations that we want to trigger the revision number bump.
+
+What this guarantees:
+
+- An object in a notification is newer (from a DB state perspective) than
+  an object with a lower revision number. So any objects with lower revision
+  numbers can safely be ignored since they represent stale DB state.
+
+What this doesn't guarantee:
+
+- Message ordering 'on the wire'. An AMQP listener may end up receiving an
+  older state than a message it has already received. It's up to the listener
+  to look at the revision number to determine if the message is stale.
+- That each intermediary state is transmitted. If a notification mechanism
+  reads the DB to get the full object to send, the DB state may have progressed
+  so it will notify with the latest state than the state that triggered the
+  original notification. This is acceptable for all of our use cases since we
+  only care about the current state of the object to wire up the dataplane. It
+  is also effectively what we have now since the DB state could change between
+  when the agent gets a notification and when it actually asks the server for
+  details.
+- Reliability of the notifications themselves. This doesn't address the issue
+  we currently have where a dropped notification is not detected.
+
+
+Notifications Impact
+--------------------
+
+Making existing notifications significantly more data-rich. The hope here is
+to eliminate many of the expensive RPC calls that each agent makes and have
+each agent derive all state from notifications with one sync method for
+recovery/initialization that we can focus on optimizing.
+
+This will result in more data being sent up front by the server to the
+messaging layer, but it will eliminate the data that would be sent in
+response to a call request from the agent in the current pattern. For a
+single agent, the only gain is the elimination of the notification and
+call messages; but for multiple agents interested in the same resource,
+it eliminates extra DB calls and extra messages from the server to fulfill
+those calls.
+
+This pattern will result in fewer messages sent to oslo messaging because
+of the elimination of the calls from the agents that would result in the
+same payload we are preemptively broadcasting once instead of casting
+multiple times to each requesting agent.
+
+
+Performance Impact
+------------------
+
+Higher ratio of neutron agents per server afforded by a large reduction in
+sporadic queries by the agents.
+
+This comes at a cost of effectively serializing operations on an individual
+object due to the compare and swap operation on the server. For example,
+if two server threads try to update a single object concurrently and both
+read the current state of the object at the same time, one will fail on
+commit with a StaleDataError which will be retried by the API layer.
+Previously both of these would succeed because the UPDATE statement
+would have no compare-and-swap WHERE criteria. However, this is a very
+reasonable performance cost to pay considering that concurrent updates to
+the same API object are not common.
+
+
+Other Deployer Impact
+---------------------
+
+N/A - upgrade path will maintain normal N-1 backward compatibility on the
+server so all of the current RPC endpoints will be left untouched for one
+cycle.
+
+
+Developer Impact
+----------------
+
+Need to change development guidelines to avoid the implementation of new
+direct server calls.
+
+The notifications will have to send out oslo versioned objects since
+notifications don't have RPC versions. So at a minimum we need to
+switch to oslo versioned objects in the notification code if we
+can't get them fully implemented everywhere else. To do this we
+can leverage the RPC callbacks mechanism.
+
+
+Alternatives
+------------
+
+Maintain the current information retrieval pattern and just adjust the timeout
+mechanism for everything to include back-offs or use cast/cast instead of
+calls. This will allow a system to automatically recover from self-induced
+death by stampede, but it will not make the performance any more predictable.
+
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Primary assignee:
+  kevinbenton
+  Ihar Hrachyshka
+
+
+Work Items
+----------
+
+* Exponential back-off for timeouts on agents
+* Implement 'revision' extension to add the revision_number column to the
+  data-model and expose it as a standard attribute.
+* Write tests to ensure revisions are incremented as expected
+* Write (at least one) test that verifies a StaleDataError is triggered
+  in the event of concurrent updates.
+* Update DHCP agent to make use of this new 'revision' field to discard stale
+  updates. This will be used as the proof of concept for this approach since
+  the DHCP agent is currently exposed to operating on stale data with out of
+  order messages.
+* Replace the use of sync_routers calls on the L3 agents for the most frequent
+  operations (e.g. floating IP associations, etc) with RPC callbacks once the
+  OVO work allows it.
+* Stand up grenade partial job to make sure agents using different OVO versions
+  maintain N-1 compatibility
+* Update devref for callbacks
+
+
+Possible Future Work
+--------------------
+* Switch to cast/cast pattern so agent isn't blocked waiting on server
+* Setup a periodic system based on these revision numbers to have the agents
+  figure out if they have lost updates from the server. (e.g. periodic
+  broadcasts of revision numbers and UUIDs, sums of collections of revisions,
+  etc.).
+* Add an 'RPC pain multiplier' option that just causes all calls to the
+  neutron server to be duplicated X number of times. That way we can set
+  it to something like 200 for the gate which will force us to make every
+  call reasonably performant.
+* Allow the HTTP API to perform compare and swap updates by placing an if-match
+  header with the revision number, which would cause the update to fail if
+  the version changed.
+
+
+Testing
+=======
+
+* The grenade partial job will be important to ensure we maintain our N-1
+  backward compatibility with agents from the previous release.
+* API tests will be added to ensure the basic operation of the revision numbers
+* Functional and unit tests to test the agent reactions to payloads
+
+
+Documentation Impact
+====================
+
+
+User Documentation
+------------------
+
+N/A
+
+
+Developer Documentation
+-----------------------
+
+Devref guidelines on the pattern for getting information to agents and what
+the acceptability criteria are for calls to the server.
+
+RPC callbacks devref will need to be updated with notification strategy.
+
+References
+==========
+
+1. http://git.openstack.org/cgit/openstack/neutron/tree/doc/source/devref/rpc_callbacks.rst
+2. https://www.rabbitmq.com/semantics.html#ordering