Currently, the dhcp Provisioning of ports is the crucial bottleneck
of that concurrently boot multiple VM.
The root cause is that these ports will be processed one by one by dhcp
agent when they belong to the same network, And the 'Provisioning complete'
port is still blocked other port's processing in other dhcp agents. The
patch aim to optimize the dispatch strategy of the port cast to agent to
improve the Provisioning process.
In server side, I classify messages to multi levels. Especially, I classify
the port_update_end or port_create_end message to two levels, the high-level
message only cast to one agent, the low-level message cast to all agent. In
agent side I put these messages to `resource_processing_queue`, with the queue,
We can delete `_net_lock` and process these messages in order of priority.
Additonally, I modified the `resource_processing_queue` for my demand. I update
`_queue` from LIST to PriorityQueue in `ExclusiveResourceProcessor`, by this
way, we can sort all message which cached in `ExclusiveResourceProcessor` by
The port delete events are not synchronized with network rpc events. This
creates a condition which makes it possible for a port delete event to be
processed just before a previously started network query completes.
The problematic order of operations is as follows:
1) a network is scheduled to an agent; a network rpc is sent to the
2) the agent queries the network data from the server
3) while that query is in progress a port on that network is deleted; a
port rpc is sent to the agent
4) that port delete rpc is received before the network query rpc
5) the port delete results in no action because the port was not present
on the agent
6) the network query finishes and adds the port to the cache (even
though the port has already been deleted)
7) some time passes and a new port is configured with the same IP
address as the port that was deleted in (3)
8) the dhcp host file is corrupted with 2 entries for the same IP
9) dhcp queries for the newest port is rejected because of the duplicate
entry in the dhcp host file.
The solution is to add the network_id to the port_delete_end rpc event
so that the _net_lock(network_id) synchronization point can be acquired
so that it is processed serially with other network related events.
To ensure backwards compatibility with newer agents running against older
servers the determination of which network_id value to use in the lock is
handled using a utility that will fallback to the previous mode of operation
whenever the network_id attribute is not present in the *_delete_end RPC
events. That utility can be removed in the future when it is guaranteed
that the network_id attribute will be present in RPC messages from the
Signed-off-by: Allain Legacy <firstname.lastname@example.org>
The neutron.common.topics module was rehomed into neutron-lib with
This patch consumes it by removing the rehomed module from neutron
and using the module from neutron-lib instead.
The is_extension_supported function now lives in neutron-lib. This patch
removes the function from neutron and uses lib's version instead.
This patch switches callbacks over to the payload object style events
 for BEFORE_RESPONSE and AFTER_REQUEST based notifications. To do
so an APIEventPayload object is used with the publish() method to
pass along the API related data. In addition a few UTs are updated to
work with the changes.
The neutron-lib commit I360545b6ee4291547e0c5c8e668ad03d3efa4725 moved
the externally consumed globals from neutron.common.constants into lib.
With the exception of PROVISIONAL_IPV6_PD_PREFIX all other constants
in neutron.common.constants should only be used in neutron, and will
hopefully remain that way. External consumers needing access to other
common constants should move them into lib first.
Changing rpc_api.rst file path from doc/source/devref/rpc_api.rst
to /doc/source/contributor/internals/rpc_api.rst. Because rpc_api.rst
file is located at this path
Since Pike log messages should not be translated.
This patch removes calls to i18n _LC, _LI, _LE, _LW from
logging logic throughout the code. Translators definition
from neutron._i18n is removed as well.
This patch also removes log translation verification from
ignore directive in tox.ini.
DVR router updates are notified based on where they are hosted.
But in the case of metering rpc notifier the notification to a
specific host was not implemented and also when a call is being
made to routers_updated_on_host, it is being ignored since it only
implements the routers_updated function call.
The well known service type constants are in
neutron_lib.plugins.constants, but for legacy reasons a few still exist
and are referenced from neutron_lib.constants that we'd like to remove.
This patch switches references over to neutron_lib's plugin constants.
The callback modules have been available in neutron-lib since commit 
and are ready for consumption.
As the callback registry is implemented with a singleton manager
instance, sync complications can arise ensuring all consumers switch to
lib's implementation at the same time. Therefore this consumption has
been broken down:
1) Shim neutron's callbacks using lib's callback system and remove
existing neutron internals related to callbacks (devref, UTs, etc.).
2) Switch all neutron's callback imports over to neutron-lib's.
3) Have all sub-projects using callbacks move their imports over to use
neutron-lib's callbacks implementation.
4) Remove the callback shims in neutron-lib once sub-projects are moved
over to lib's callbacks.
5) Follow-on patches moving our existing uses of callbacks to the new
event payload model provided by neutron-lib.callback.events
This patch implements #2 from above, moving all neutron's callback
imports to use neutron-lib's callbacks.
There are also a few places in the UT code that still patch callbacks,
we can address those in step #4 which may need .
On profiling the get_devices_details communications between
the agent and the server, a significant amount of time
(60% in my dev env) is being spent in the AFTER_UPDATE events
for the port updates resulting from the port status changes.
One of the major offenders is the native DHCP agent notifier.
On each port update it ends up retrieving the network for the
port, the DHCP agents for the network, and the segments.
This patch addresses this particular issue by adding logic to
skip a DHCP notification if the only thing that changed on the
port was the status. The DHCP agent doesn't do anything based on
the status field so there is no need to update it when this is
the only change.
Neutron Manager is loaded at the very startup of the neutron
server process and with it plugins are loaded and stored for
lookup purposes as their references are widely used across the
entire neutron codebase.
Rather than holding these references directly in NeutronManager
this patch refactors the code so that these references are held
by a plugin directory.
This allows subprojects and other parts of the Neutron codebase
to use the directory in lieu of the manager. The result is a
leaner, cleaner, and more decoupled code.
Usage pattern [1,2] can be translated to [3,4] respectively.
The more entangled part is in the neutron unit tests, where the
use of the manager can be simplified as mocking is typically
replaced by a call to the directory add_plugin() method. This is
safe as each test case gets its own copy of the plugin directory.
That said, unit tests that look more like API tests and that rely on
the entire plugin machinery, need some tweaking to avoid stumbling
into plugin loading failures.
Due to the massive use of the manager, deprecation warnings are
considered impractical as they cause logs to bloat out of proportion.
Follow-up patches that show how to adopt the directory in neutron
subprojects are tagged with topic:plugin-directory.
Partially-implements: blueprint neutron-lib
This makes the notifier subscribe to core resource events
and leverage them if they are available. This solves the
issue where internal core plugin calls from service plugins
were not generating DHCP agent notifications.
Sending arp update to each l3 dvr agent one by one on every port
creation is not scalable and causes serious performance degradation
if router is hosted on lots of l3 dvr agents on compute nodes (see
bug report). This increases port creation time and eventually leads
to timeouts in Nova and VMs going to ERROR state.
This patch changes notification to be fanout.
The downside is that with fanout the arp notification will be sent to
each l3 agent, even those not hosting the router. However such agents
will just skip the notification if not hosting the router - this should
be quite cheap.
Bug 1591766 unveiled an issue where calling the plugin API does not trigger
DHCP notifications. This is required by the auto-allocated-topology service
plugin that calls core_plugin.update_network(), and expect notifications
to be sent out on state changes. To accomplish this, the logic has been
encapsulated in the DHCP module, and leveraged via callback mechanisms.
For this reason, new events have been introduced, AFTER_REQUEST, and
BEFORE_RESPONSE. The latter in particular is the one needed to hook up
dhcp notifications in order to preserve backward compatibility.
More precisely, core plugins that use DHCP as is or implement their own,
(with or without an agent) should already instantiate their own notifier,
and if they do not, this should be rectified.
A search on codesearch.openstack.org reveals that out-of-tree plugins
already specify their own notifiers, and the default initialization is
clearly redundant now.
Since a network can now be broken up by segments, each segment will
need to have its own DHCP Agent. This maintains backwards
compatibility when a network does not have a segment. However,
once a segment is created on a network, a dhcp agent should be
scheduled per segment with a dhcp enabled subnet.
The scheduling happens by filtering the candidate dhcp agents by
the hosts that are bound to that segment.
Partially-Implements: blueprint routed-networks
This patch introduces retry(func, max_attempts) method
which wraps the original func in such a way,
that if execution results in MessagingException,
the given function will be retried for max_attempts
till success or MessagingException is raised.
The function is in the utils module and can be reused
by different agentnotifiers if necessary.
Once the spinout is undergoing we should perform the eviction.
Partially-implements: blueprint bgp-spinout
In Dnsmasq, the function get_isolated_subnets() returns a list of
subnets in a network if the subnet is not connected to a router.
The implementation of this function checks all the router interface
ports in a cached network object passed from DHCP agent. But the
cached network object is not updated when a subnet is attached to
or detached from a router.
This patch fixes that by adding callback functions in DHCP RPC client
to notify DHCP agent when changes happen on router interfaces.
This patch implements a new agent named "BgpDrAgent". The new agent
will host different BGP speaking drivers and makes the required BGP
peering session/s for neutron. The agent takes the needed "peer/s and
route/s" information from the BGP speaker entity and synchronize the
same to the registerd driver.
For realizing HA, two BgpDrAgents should host the same BGP speaker.
Partially-Implements: blueprint bgp-dynamic-routing
Co-Authored-By: Ryan Tidwell <email@example.com>
Co-Authored-By: Jaume Devesa <firstname.lastname@example.org>
Co-Authored-By: Numan Siddique <email@example.com>
When we remove explicit binding of dvr routers to compute nodes
we'll need a way to know all hosts where a dvr router should be
hosted in order to send notifications.
This patch adds such a query and updates l3 rpc notifier to use it.
Partially implements blueprint improve-dvr-l3-agent-binding
- This does NOT break other projects that rely on neutron.i18n,
as this change includes a debtcollector shim to maintain those
older entry points, until they can migrate.
- Also updates _i18n.py to the latest pattern defined by oslo_i18n
- Guidance and template are from the reference:
Currently router_added (and other) notifications are sent
to agents with an RPC cast() method which does not ensure that
the message is actually delivered to the recipient.
If the message is lost (due to instability of messaging system
in some failover scenarios for example) neither server nor agent
will be aware of that and router will be "lost" till next agent
resync. Resync will only happen in case of errors on agent side
The fix makes server use call() to notify agents about added routers
thus ensuring no routers will be lost.
This also unifies reschedule_router() method to avoid code duplication
between legacy and dvr agent schedulers.
The fipnamespace is associated with an external network
on a given node. In the case of DVR there is just one
single FIP namespace for a given node.
We have seen some race conditions in the agent for creation
and deletion of the fip namespace. See the bug report for
details on the failure.
So in order to address this race condition and make the
code more stable, we will be cleaning up the fip namespace
only when an external network is removed.
The server will be sending a rpc notification message to
the agent to cleanup the fip namespace when the external
net is removed.
This patch address the above mentioned issue by not constantly
deleting and creating the fip namespace.
Currently when floating ip is created, a lot of useless action
is happening: floating ip router is scheduled, all l3 agents where
router is scheduled are notified about router update, all agents
request full router info from server. All this becomes a big
performance problem at scale with lots of compute nodes.
In fact on (associated) Floating IP creation we really need
to notify specific l3 agent on compute node where associated
VM port is located and do not need to schedule router and
bother other agents where rourter is scheduled. This should
significally decrease unneeded load on neutron server at scale.
This patch is to address the failure of manual move of
dvr_snat routers from one service node to another.
The entry in the csnat_l3_agent_bindings table is now removed
during the router to agent unbind operation.
Appropriate notification is now sent to the agent to remove
There were other places in the code
that needed to examine the snat binding table to
check if updates were required -
Additionally, schedule_routers() was made optional
within the rpc _notification path since it can
override the manual move being attempted.
Co-Authored-By: Ila Palanisamy <firstname.lastname@example.org>
Co-Authored-By: Oleg Bondarev <email@example.com>
This cannot be done in Python 3, where dict.keys() returns an iterator. We need
to cast the result of dict.keys() to a list first.
Previously when admin_state_up of an agent is turned to False,
all services on it will be disabled.
This fix makes existing services on agents with admin_state_up
False keep available.
To keep current behavior available the following configuration
If the parameter is True, existing services on agents with admin_state_up
False keep available. No more service will be scheduled to the agent
automatically. But adding a service to the agent manually is available.
i.e. admin_state_up: False means to stop automatic scheduling under the
parameter is True.
The default of the parameter is False (current behavior).
Now we send all labels and rules per rule create/delete
and rebuild whole iptables chains.
In this patch we send only affected rule and create/
delete only this rule from iptables.
Change the DHCP notifier behavior to schedule a network
to a DHCP agent when a subnet is created rather than
waiting for the first port to be created.
This will reduce the possibility to get a VM port created
and have it send a DHCP request before the DHCP agent is
ready. Before, the network would be scheduled to an agent
as a result of the API call to create the VM port, so the
DHCP port wouldn't be created until after the VM port.
After this patch, the network will have been scheduled to
a DHCP agent before the first VM port is created.
There is still a possibility that the DHCP agent could be
responding so slowly that it doesn't create its port and
activate the dnsmasq instance before the VM sends traffic.
A proper fix will ensure that the dnsmasq instance is
truly ready to serve requests for a new port will require
significantly more code for barriers (either on the subnet
creation, port creation, or the nova boot process) are too
complex to add this late in the cycle.
This patch also eliminates the logic in the n1kv plugin that
was already doing the same thing.
It's mostly a matter of changing imports to a new location.
Non-obvious changes needed:
* pass overwrite= argument to oslo_context since oslo.log reads context
from its thread local store and not local.store from incubator
* don't store context at local.store now that there is no code that
would consume it
* LOG.deprecated() -> versionutils.report_deprecated_feature()
* dropped LOG.audit check from hacking rule since now the method does
* WritableLogger is now located in oslo_log.loggers
Dropped log module from the tree. Also dropped local module that is now
of no use (and obsolete, as per oslo team).
Added versionutils back to openstack-common.conf since now we use the
module directly from neutron code and not just as a dependency of some
other oslo-incubator module.
Note: tempest tests are expected to be broken now, so instead of fixing
all the oslo.log related issues for the subtree in this patch, I only
added TODOs with directions for later fix.
There is an rpc interface defined for the Neutron plugin to be able to
execute methods in the DHCP agent. Provide docstring pointers in the
client and server side that tells you where to find the other side of
No namespace usage is needed here. This API is the only one exposed
via the DHCP agent, so the default namespace used now is fine.
The DhcpAgent class was updated to explicitly define the
messaging.Target(). Previously it was using the equivalent one
defined in the Manager base class. Having it specified here makes it
more obvious that this is an rpc endpoint, and also provides the
obvious place that must have the version updated if the interface is
Part of blueprint rpc-docs-and-namespaces.
Remove usage of the RpcProxy compatibility class from the
DhcpAgentNotifyAPI class. The equivalent oslo.messaging APIs are now
Part of blueprint drop-rpc-compat.