Add spec for push notification refactor

Adds a spec to change the method we use to get information from the server to the agents. Rather than the server notifying the agent to call the server, we can just put the relevant data in the notification itself to improve scalability and reliability. The bulk of this spec is dealing with the message ordering guarantee we will need to accomplish this. It also has some work items to help improve our current pattern. Related-Bug: #1516195 Change-Id: I3af200ad84483e6e1fe619d516ff20bc87041f7c
2015-09-17 05:57:07 -07:00 · 2015-09-17 05:57:07 -07:00 · 875253ecec
parent 1651d6a0bd
commit 875253ecec
1 changed files with 335 additions and 0 deletions
--- a/specs/newton/push-notifications.rst
+++ b/specs/newton/push-notifications.rst
@ -0,0 +1,335 @@
 ..
 This work is licensed under a Creative Commons Attribution 3.0 Unported
 License.
 http://creativecommons.org/licenses/by/3.0/legalcode
 =============================
 Push Notifications for Agents
 =============================
 RFE:
 https://bugs.launchpad.net/neutron/+bug/1516195
 Launchpad blueprint:
 https://blueprints.launchpad.net/neutron/+spec/push-notifications
 The current method we use to get information from the server to the agent
 is driven by notification and error-triggered calls to the server by the agent.
 So during normal operation, the server will send out a notification that a
 specific object has changed (e.g. a port) and then the agent will respond
 to that by querying the server for information about that port. If the
 agent encounters a failure while processing changes, it will start over
 and re-query the server in the process.
 The load on the server from this agent-driven approach can be very
 unpredictable depending on the changes to object states on the neutron
 server. For example, a single network update will result in a query from
 every L2 agent with a port on that network.
 This blueprint aims to change the pattern we use to get information to the
 agents to primarily be based on pushing the object state out in the change
 notifications. For anything not changed to leverage this method of retrieval
 (e.g. initial agent startup still needs to poll), the AMQP timeout handling
 will be fixed to ensure it has an exponential back-off to prevent the agents
 from stampeding the server.
 Problem Description
 ===================
 An outage of a few agents and their recovery can lead to all of the agents
 drowning the neutron servers with requests. This can cause the neutron servers
 to fail to respond in time, which results in more retry requests building up,
 leaving the entire system useless until operator intervention.
 This is caused by 3 problems:
 * We don't make optimal use of server notifications. There are times when
  the server will send a notification to an agent to inform it that something
  has changed. Then the agent has to make a call back to the server to get the
  relevant details. This means a single L3 rescheduling event of a set of
  routers due to a failed L3 agent can result in N more calls to the server
  where N is the number of routers. Compounding this issue, a single agent
  may make multiple calls to the server for a single operation (e.g. the L2
  agent will make one call for port info, and then another for security group
  info).
 * The agents will give up after a short period of time on a request and retry
  the request or issue an even more expensive request (e.g. if synchronizing
  info for one item fails, a major issue is assumed so a request to sync all
  items is issued). So by the time the server finishes fulfilling a request,
  the client is no longer waiting for the response so it goes in the
  trash. As this compounds, it leaves the server processing a massive queue
  of requests that won't even have listeners for the responses.
 * Related to the second item is the fact that the agents are aggressive in
  their retry mechanisms. If a request times out, that request is immediately
  retried with the same timeout value; that is, they have no back-off
  mechanism. (This has now been addressed by
  https://review.openstack.org/#/c/280595/ which adds backoff,
  sleep, and jitter.)
 Proposed Change
 ===============
 Eliminate expensive cases where calls are made to the neutron server in
 response to a notification generated by the server. In most of these cases
 where the agent is just asking for regular neutron objects
 (e.g. ports, networks), we can leverage the RPC callbacks mechanism
 introduced in Liberty[1] to have the server send the entire changed object
 as part of the notification so the agent has the information it needs.
 The main targets for this will be the security group info call,
 the get_device_details call, and the sync_routers call. Others will be
 included if the change is trivial once these three are done.
 The DHCP agent already relies on push notifications, so it will just
 be updated to use the revision number to detect the out of order events
 it's susceptible to now.
 For the remaining calls that cannot easily be converted into the callbacks
 mechanism (e.g. the security groups call which blends several objects,
 the initial synchronization mechanism, and agent-generated calls), a nicer
 timeout mechanism will be implemented with an exponential back-off and timeout
 increase so a heavily loaded server is not continuously hammered to death.
 Changes to RPC callback mechanism
 ---------------------------------
 The current issue with the RPC callback mechanism and sending objects as
 notifications is a lack of server operation ordering guarantees and
 AMQP message ordering guarantees.
 To illustrate the first issue, examine the following order of events that
 happen when two servers update the same port:
 * Server 1 commits update to DB
 * Server 2 commits update to DB
 * Server 2 sends notification
 * Server 1 sends notification
 If the agent receives the notifications in the order in which they are
 delivered to AMQP, it will think the state delivered by Server 1 is the
 current state when it is actually the state committed by Server 2.
 We also have the same issue when oslo messaging doesn't guarantee
 message order (e.g. ZeroMQ). Even if Server 1 sends immediately after
 its commit and before Server 2 commits and sends, one or more of the
 agents could end up seeing Server 2's message before Server 1's.
 To handle this, we will add a revision number, implemented as a monotonic
 counter, to each object. This counter will be incremented on any update
 so any agent can immediately identify stale messages.
 To address deletes arriving before updates, agents will be expected
 to keep a set of the UUIDs that have been deleted. Upon receiving an update,
 the agent will check this set for the object's UUID and ignore the update
 since deletes are permanent and UUIDs cannot be re-used. If we do make IDs
 recyclable in the future, this can be replaced with a strategy to confirm
 ID existence with the server or we can add another internal UUID that
 cannot be specified.
 Note that this doesn't guaruntee message ordering for the agent because
 that is a property of the message backend, but it does give the agent the
 necessary info to re-order messages when it receives them so they can
 determine which one reflects the more recent state of the DB.
 Data Model Impact
 -----------------
 A 'revision_number' column will be added to the standard attr table. This
 column will just be a simple big integer used as monotonic counter that
 will be updated whenever the object is updated on the neutron server.
 This revision number can then be used by the agents to automatically
 discard any object states that are older than the state they already have.
 This revision_number will use the version counter feature which is built-in
 to SQLAlchemy: http://docs.sqlalchemy.org/en/latest/orm/versioning.html
 Each time an object is updated, the server will perform a compare-and-swap
 operation based on the revision number. This ensures that each update must
 start with the current revision number or it will fail with a StaleDataError.
 The API layer can catch this error with the current DB retry mechanism and
 start over with the latest revision number.
 While SQLAlchemy will automatically bump the revision for us when the record
 for an object is updated (e.g. a standard attr description field), it will
 not update it if it's a related object changing (e.g. adding an IP address
 to the port or changing its status). So we will have to manually trigger
 the revision bump (either via a PRECOMMIT callback or inline code) for
 any operations that we want to trigger the revision number bump.
 What this guarantees:
 - An object in a notification is newer (from a DB state perspective) than
  an object with a lower revision number. So any objects with lower revision
  numbers can safely be ignored since they represent stale DB state.
 What this doesn't guarantee:
 - Message ordering 'on the wire'. An AMQP listener may end up receiving an
  older state than a message it has already received. It's up to the listener
  to look at the revision number to determine if the message is stale.
 - That each intermediary state is transmitted. If a notification mechanism
  reads the DB to get the full object to send, the DB state may have progressed
  so it will notify with the latest state than the state that triggered the
  original notification. This is acceptable for all of our use cases since we
  only care about the current state of the object to wire up the dataplane. It
  is also effectively what we have now since the DB state could change between
  when the agent gets a notification and when it actually asks the server for
  details.
 - Reliability of the notifications themselves. This doesn't address the issue
  we currently have where a dropped notification is not detected.
 Notifications Impact
 --------------------
 Making existing notifications significantly more data-rich. The hope here is
 to eliminate many of the expensive RPC calls that each agent makes and have
 each agent derive all state from notifications with one sync method for
 recovery/initialization that we can focus on optimizing.
 This will result in more data being sent up front by the server to the
 messaging layer, but it will eliminate the data that would be sent in
 response to a call request from the agent in the current pattern. For a
 single agent, the only gain is the elimination of the notification and
 call messages; but for multiple agents interested in the same resource,
 it eliminates extra DB calls and extra messages from the server to fulfill
 those calls.
 This pattern will result in fewer messages sent to oslo messaging because
 of the elimination of the calls from the agents that would result in the
 same payload we are preemptively broadcasting once instead of casting
 multiple times to each requesting agent.
 Performance Impact
 ------------------
 Higher ratio of neutron agents per server afforded by a large reduction in
 sporadic queries by the agents.
 This comes at a cost of effectively serializing operations on an individual
 object due to the compare and swap operation on the server. For example,
 if two server threads try to update a single object concurrently and both
 read the current state of the object at the same time, one will fail on
 commit with a StaleDataError which will be retried by the API layer.
 Previously both of these would succeed because the UPDATE statement
 would have no compare-and-swap WHERE criteria. However, this is a very
 reasonable performance cost to pay considering that concurrent updates to
 the same API object are not common.
 Other Deployer Impact
 ---------------------
 N/A - upgrade path will maintain normal N-1 backward compatibility on the
 server so all of the current RPC endpoints will be left untouched for one
 cycle.
 Developer Impact
 ----------------
 Need to change development guidelines to avoid the implementation of new
 direct server calls.
 The notifications will have to send out oslo versioned objects since
 notifications don't have RPC versions. So at a minimum we need to
 switch to oslo versioned objects in the notification code if we
 can't get them fully implemented everywhere else. To do this we
 can leverage the RPC callbacks mechanism.
 Alternatives
 ------------
 Maintain the current information retrieval pattern and just adjust the timeout
 mechanism for everything to include back-offs or use cast/cast instead of
 calls. This will allow a system to automatically recover from self-induced
 death by stampede, but it will not make the performance any more predictable.
 Implementation
 ==============
 Assignee(s)
 -----------
 Primary assignee:
  kevinbenton
  Ihar Hrachyshka
 Work Items
 ----------
 * Exponential back-off for timeouts on agents
 * Implement 'revision' extension to add the revision_number column to the
  data-model and expose it as a standard attribute.
 * Write tests to ensure revisions are incremented as expected
 * Write (at least one) test that verifies a StaleDataError is triggered
  in the event of concurrent updates.
 * Update DHCP agent to make use of this new 'revision' field to discard stale
  updates. This will be used as the proof of concept for this approach since
  the DHCP agent is currently exposed to operating on stale data with out of
  order messages.
 * Replace the use of sync_routers calls on the L3 agents for the most frequent
  operations (e.g. floating IP associations, etc) with RPC callbacks once the
  OVO work allows it.
 * Stand up grenade partial job to make sure agents using different OVO versions
  maintain N-1 compatibility
 * Update devref for callbacks
 Possible Future Work
 --------------------
 * Switch to cast/cast pattern so agent isn't blocked waiting on server
 * Setup a periodic system based on these revision numbers to have the agents
  figure out if they have lost updates from the server. (e.g. periodic
  broadcasts of revision numbers and UUIDs, sums of collections of revisions,
  etc.).
 * Add an 'RPC pain multiplier' option that just causes all calls to the
  neutron server to be duplicated X number of times. That way we can set
  it to something like 200 for the gate which will force us to make every
  call reasonably performant.
 * Allow the HTTP API to perform compare and swap updates by placing an if-match
  header with the revision number, which would cause the update to fail if
  the version changed.
 Testing
 =======
 * The grenade partial job will be important to ensure we maintain our N-1
  backward compatibility with agents from the previous release.
 * API tests will be added to ensure the basic operation of the revision numbers
 * Functional and unit tests to test the agent reactions to payloads
 Documentation Impact
 ====================
 User Documentation
 ------------------
 N/A
 Developer Documentation
 -----------------------
 Devref guidelines on the pattern for getting information to agents and what
 the acceptability criteria are for calls to the server.
 RPC callbacks devref will need to be updated with notification strategy.
 References
 ==========
 1. http://git.openstack.org/cgit/openstack/neutron/tree/doc/source/devref/rpc_callbacks.rst
 2. https://www.rabbitmq.com/semantics.html#ordering