diff --git a/specs/newton/push-notifications.rst b/specs/newton/push-notifications.rst new file mode 100644 index 000000000..757e22c9a --- /dev/null +++ b/specs/newton/push-notifications.rst @@ -0,0 +1,335 @@ +.. + This work is licensed under a Creative Commons Attribution 3.0 Unported + License. + + http://creativecommons.org/licenses/by/3.0/legalcode + +============================= +Push Notifications for Agents +============================= + +RFE: +https://bugs.launchpad.net/neutron/+bug/1516195 + +Launchpad blueprint: +https://blueprints.launchpad.net/neutron/+spec/push-notifications + +The current method we use to get information from the server to the agent +is driven by notification and error-triggered calls to the server by the agent. +So during normal operation, the server will send out a notification that a +specific object has changed (e.g. a port) and then the agent will respond +to that by querying the server for information about that port. If the +agent encounters a failure while processing changes, it will start over +and re-query the server in the process. + +The load on the server from this agent-driven approach can be very +unpredictable depending on the changes to object states on the neutron +server. For example, a single network update will result in a query from +every L2 agent with a port on that network. + +This blueprint aims to change the pattern we use to get information to the +agents to primarily be based on pushing the object state out in the change +notifications. For anything not changed to leverage this method of retrieval +(e.g. initial agent startup still needs to poll), the AMQP timeout handling +will be fixed to ensure it has an exponential back-off to prevent the agents +from stampeding the server. + + +Problem Description +=================== + +An outage of a few agents and their recovery can lead to all of the agents +drowning the neutron servers with requests. This can cause the neutron servers +to fail to respond in time, which results in more retry requests building up, +leaving the entire system useless until operator intervention. + +This is caused by 3 problems: + +* We don't make optimal use of server notifications. There are times when + the server will send a notification to an agent to inform it that something + has changed. Then the agent has to make a call back to the server to get the + relevant details. This means a single L3 rescheduling event of a set of + routers due to a failed L3 agent can result in N more calls to the server + where N is the number of routers. Compounding this issue, a single agent + may make multiple calls to the server for a single operation (e.g. the L2 + agent will make one call for port info, and then another for security group + info). + +* The agents will give up after a short period of time on a request and retry + the request or issue an even more expensive request (e.g. if synchronizing + info for one item fails, a major issue is assumed so a request to sync all + items is issued). So by the time the server finishes fulfilling a request, + the client is no longer waiting for the response so it goes in the + trash. As this compounds, it leaves the server processing a massive queue + of requests that won't even have listeners for the responses. + +* Related to the second item is the fact that the agents are aggressive in + their retry mechanisms. If a request times out, that request is immediately + retried with the same timeout value; that is, they have no back-off + mechanism. (This has now been addressed by + https://review.openstack.org/#/c/280595/ which adds backoff, + sleep, and jitter.) + + +Proposed Change +=============== + +Eliminate expensive cases where calls are made to the neutron server in +response to a notification generated by the server. In most of these cases +where the agent is just asking for regular neutron objects +(e.g. ports, networks), we can leverage the RPC callbacks mechanism +introduced in Liberty[1] to have the server send the entire changed object +as part of the notification so the agent has the information it needs. + +The main targets for this will be the security group info call, +the get_device_details call, and the sync_routers call. Others will be +included if the change is trivial once these three are done. +The DHCP agent already relies on push notifications, so it will just +be updated to use the revision number to detect the out of order events +it's susceptible to now. + +For the remaining calls that cannot easily be converted into the callbacks +mechanism (e.g. the security groups call which blends several objects, +the initial synchronization mechanism, and agent-generated calls), a nicer +timeout mechanism will be implemented with an exponential back-off and timeout +increase so a heavily loaded server is not continuously hammered to death. + + +Changes to RPC callback mechanism +--------------------------------- + +The current issue with the RPC callback mechanism and sending objects as +notifications is a lack of server operation ordering guarantees and +AMQP message ordering guarantees. + +To illustrate the first issue, examine the following order of events that +happen when two servers update the same port: + +* Server 1 commits update to DB +* Server 2 commits update to DB +* Server 2 sends notification +* Server 1 sends notification + +If the agent receives the notifications in the order in which they are +delivered to AMQP, it will think the state delivered by Server 1 is the +current state when it is actually the state committed by Server 2. + +We also have the same issue when oslo messaging doesn't guarantee +message order (e.g. ZeroMQ). Even if Server 1 sends immediately after +its commit and before Server 2 commits and sends, one or more of the +agents could end up seeing Server 2's message before Server 1's. + +To handle this, we will add a revision number, implemented as a monotonic +counter, to each object. This counter will be incremented on any update +so any agent can immediately identify stale messages. + +To address deletes arriving before updates, agents will be expected +to keep a set of the UUIDs that have been deleted. Upon receiving an update, +the agent will check this set for the object's UUID and ignore the update +since deletes are permanent and UUIDs cannot be re-used. If we do make IDs +recyclable in the future, this can be replaced with a strategy to confirm +ID existence with the server or we can add another internal UUID that +cannot be specified. + +Note that this doesn't guaruntee message ordering for the agent because +that is a property of the message backend, but it does give the agent the +necessary info to re-order messages when it receives them so they can +determine which one reflects the more recent state of the DB. + + +Data Model Impact +----------------- + +A 'revision_number' column will be added to the standard attr table. This +column will just be a simple big integer used as monotonic counter that +will be updated whenever the object is updated on the neutron server. +This revision number can then be used by the agents to automatically +discard any object states that are older than the state they already have. + +This revision_number will use the version counter feature which is built-in +to SQLAlchemy: http://docs.sqlalchemy.org/en/latest/orm/versioning.html +Each time an object is updated, the server will perform a compare-and-swap +operation based on the revision number. This ensures that each update must +start with the current revision number or it will fail with a StaleDataError. +The API layer can catch this error with the current DB retry mechanism and +start over with the latest revision number. + +While SQLAlchemy will automatically bump the revision for us when the record +for an object is updated (e.g. a standard attr description field), it will +not update it if it's a related object changing (e.g. adding an IP address +to the port or changing its status). So we will have to manually trigger +the revision bump (either via a PRECOMMIT callback or inline code) for +any operations that we want to trigger the revision number bump. + +What this guarantees: + +- An object in a notification is newer (from a DB state perspective) than + an object with a lower revision number. So any objects with lower revision + numbers can safely be ignored since they represent stale DB state. + +What this doesn't guarantee: + +- Message ordering 'on the wire'. An AMQP listener may end up receiving an + older state than a message it has already received. It's up to the listener + to look at the revision number to determine if the message is stale. +- That each intermediary state is transmitted. If a notification mechanism + reads the DB to get the full object to send, the DB state may have progressed + so it will notify with the latest state than the state that triggered the + original notification. This is acceptable for all of our use cases since we + only care about the current state of the object to wire up the dataplane. It + is also effectively what we have now since the DB state could change between + when the agent gets a notification and when it actually asks the server for + details. +- Reliability of the notifications themselves. This doesn't address the issue + we currently have where a dropped notification is not detected. + + +Notifications Impact +-------------------- + +Making existing notifications significantly more data-rich. The hope here is +to eliminate many of the expensive RPC calls that each agent makes and have +each agent derive all state from notifications with one sync method for +recovery/initialization that we can focus on optimizing. + +This will result in more data being sent up front by the server to the +messaging layer, but it will eliminate the data that would be sent in +response to a call request from the agent in the current pattern. For a +single agent, the only gain is the elimination of the notification and +call messages; but for multiple agents interested in the same resource, +it eliminates extra DB calls and extra messages from the server to fulfill +those calls. + +This pattern will result in fewer messages sent to oslo messaging because +of the elimination of the calls from the agents that would result in the +same payload we are preemptively broadcasting once instead of casting +multiple times to each requesting agent. + + +Performance Impact +------------------ + +Higher ratio of neutron agents per server afforded by a large reduction in +sporadic queries by the agents. + +This comes at a cost of effectively serializing operations on an individual +object due to the compare and swap operation on the server. For example, +if two server threads try to update a single object concurrently and both +read the current state of the object at the same time, one will fail on +commit with a StaleDataError which will be retried by the API layer. +Previously both of these would succeed because the UPDATE statement +would have no compare-and-swap WHERE criteria. However, this is a very +reasonable performance cost to pay considering that concurrent updates to +the same API object are not common. + + +Other Deployer Impact +--------------------- + +N/A - upgrade path will maintain normal N-1 backward compatibility on the +server so all of the current RPC endpoints will be left untouched for one +cycle. + + +Developer Impact +---------------- + +Need to change development guidelines to avoid the implementation of new +direct server calls. + +The notifications will have to send out oslo versioned objects since +notifications don't have RPC versions. So at a minimum we need to +switch to oslo versioned objects in the notification code if we +can't get them fully implemented everywhere else. To do this we +can leverage the RPC callbacks mechanism. + + +Alternatives +------------ + +Maintain the current information retrieval pattern and just adjust the timeout +mechanism for everything to include back-offs or use cast/cast instead of +calls. This will allow a system to automatically recover from self-induced +death by stampede, but it will not make the performance any more predictable. + + +Implementation +============== + +Assignee(s) +----------- + +Primary assignee: + kevinbenton + Ihar Hrachyshka + + +Work Items +---------- + +* Exponential back-off for timeouts on agents +* Implement 'revision' extension to add the revision_number column to the + data-model and expose it as a standard attribute. +* Write tests to ensure revisions are incremented as expected +* Write (at least one) test that verifies a StaleDataError is triggered + in the event of concurrent updates. +* Update DHCP agent to make use of this new 'revision' field to discard stale + updates. This will be used as the proof of concept for this approach since + the DHCP agent is currently exposed to operating on stale data with out of + order messages. +* Replace the use of sync_routers calls on the L3 agents for the most frequent + operations (e.g. floating IP associations, etc) with RPC callbacks once the + OVO work allows it. +* Stand up grenade partial job to make sure agents using different OVO versions + maintain N-1 compatibility +* Update devref for callbacks + + +Possible Future Work +-------------------- +* Switch to cast/cast pattern so agent isn't blocked waiting on server +* Setup a periodic system based on these revision numbers to have the agents + figure out if they have lost updates from the server. (e.g. periodic + broadcasts of revision numbers and UUIDs, sums of collections of revisions, + etc.). +* Add an 'RPC pain multiplier' option that just causes all calls to the + neutron server to be duplicated X number of times. That way we can set + it to something like 200 for the gate which will force us to make every + call reasonably performant. +* Allow the HTTP API to perform compare and swap updates by placing an if-match + header with the revision number, which would cause the update to fail if + the version changed. + + +Testing +======= + +* The grenade partial job will be important to ensure we maintain our N-1 + backward compatibility with agents from the previous release. +* API tests will be added to ensure the basic operation of the revision numbers +* Functional and unit tests to test the agent reactions to payloads + + +Documentation Impact +==================== + + +User Documentation +------------------ + +N/A + + +Developer Documentation +----------------------- + +Devref guidelines on the pattern for getting information to agents and what +the acceptability criteria are for calls to the server. + +RPC callbacks devref will need to be updated with notification strategy. + +References +========== + +1. http://git.openstack.org/cgit/openstack/neutron/tree/doc/source/devref/rpc_callbacks.rst +2. https://www.rabbitmq.com/semantics.html#ordering