When the L3 agent get a router update notification, it will try to
retrieve the router info from neutron server. But at this time, if
the message queue is down/unreachable. It will get exceptions related
message queue. The resync actions will be run then. Sometimes, rabbitMQ
cluster is not so much easy to recover. Then Long time MQ recover time
will cause the router info sync RPC never get successful until it meets
the max retry time. Then the bad thing happens, L3 agent is trying to
remove the router now. It basically shutdown all the existing L3 traffic
of this router.
This patch directly removes the final router removal action, let the
router run as it is.
Closes-Bug: #1871850
Change-Id: I9062638366b45a7a930f31185cd6e23901a43957
Seems that is_enabled_and_bind_by_default() from
neutron.common.ipv6_utils was copied directly into
oslo_utils.netutils, so start using it instead.
Trivialfix
Change-Id: I00fa441e7a20fcd1115485bb8ab75750e6a8cf07
L3 agent supports multiple external networks from a long
time ago, so remove this RPC call since it is not used.
According to codesearch of [1] and [2], we just remove
neutron built-in L3 agent RPC. For neutron server side
or RPC callback classes, the function is still remained.
[1] http://codesearch.openstack.org/?q=get_external_network_id
[2] http://codesearch.openstack.org/?q=L3RpcCallback
Change-Id: I764423e175d6e82729a647e415a9f267f495916f
Closes-Bug: #1844168
In fetch_and_sync_all_routers method is used python's range function.
Range function accepts integers.
This patch is fixing divide behaviour in py3 where result number is float,
by retyping float to int as it is represented in py2.
Change-Id: Ifffdee0d4a3226d4871cfabd0bdbf13d7058a83e
Closes-Bug: #1824334
As described in the bug, when a HA router transitions from "master" to
"backup", "keepalived" processes will set the virtual IP in all other
HA routers. Each HA router will then advert it and "keepalived" will
decide, according to a trivial algorithm (higher interface IP), which
one should be "master". At this point, the other "keepalived" processes
running in the other servers, will remove the HA router virtual IP
assigned an instant before
To avoid transitioning some routers form "backup" to "master" and then
to "backup" in a very short period, this patch delays the "backup" to
"master" transition, waiting for a possible new "backup" state. If
during the waiting period (set to the HA VRRP advert time, 2 seconds
default) to set the HA state to "master", the L3 agent receives a new
"backup" HA state, the L3 agent does nothing.
Closes-Bug: #1837635
Change-Id: I70037da9cdd0f8448e0af8dd96b4e3f5de5728ad
In order to support out of tree interface drivers it is required
to pass a callback to allow the drivers to query information about
the network.
- Allow passing **kwargs to interface drivers
- Pass get_networks() as `get_networks_cb` kw arg
`get_networks_cb` has the same API as
`neutron.neutron_plugin_base_v2.NeutronPluginBaseV2.get_networks()`
minus the the request context which will be embeded in the callback
itself.
The out of tree interface drivers in question are:
MultiInterfaceDriver - a per-physnet interface driver that delegates
operations on a per-physnet basis.
IPoIBInterfaceDriver - an interface driver for IPoIB (IP over Infiniband)
networks.
Those drivers are a part of networking-mlnx[1], Their implementation
is vendor agnostic so they can later be moved to a more common place
if desired.
[1] https://github.com/openstack/networking-mlnx
Change-Id: I74d9f449fb24f64548b0f6db4d5562f7447efb25
Closes-Bug: #1834176
This option was deprecated since couple of releases already.
In Stein we removed 'external_network_bridge' option from L3 agent's
config so now it's time to remove also this one.
There is also new upgrade check introduced to check and warn
users if gateway_external_network_id was used in the deployment.
This patch also removes method _check_router_needs_rescheduling() from
neutron/db/l3_db.py module as it is not needed anymore.
This patch also removes unit tests:
test_update_gateway_agent_exists_supporting_network
test_update_gateway_agent_exists_supporting_multiple_network
test_router_update_gateway_no_eligible_l3_agent
from neutron/tests/unit/extensions/test_l3.py module as those
tests are not needed when there is no "gateway_external_network_id"
config option anymore.
Change-Id: Id01571cd42cfe9c5ce91e90159917c7d3c963878
- Added get_networks() RPC call for DHCP agent
- Added get_networks() RPC call for L3 agent
This change is required in order to support out of tree
MultiInterfaceDriver and IPoIBInterfaceDriver interface drivers
as they require information on the network a port is being plugged
to.
These RPCs will be passed as kwargs when loading the relevant
interface driver.
get_networks() keyword args map to the keyword arguments of:
neutron.neutron_plugin_base_v2.NeutronPluginBaseV2.get_networks()
Change-Id: I11d82380aad8655a4fdc9656737b912b16e2859b
Partial-Bug: #1834176
And set it to all the L3 RPC functions. Move to neutron-lib
if it will be widely used corss other projects.
Related-Bug: #1835663
Change-Id: Ie7743db097fd45df432af341470336d6a5662c6f
Change [1] removed the deprecated option external_network_bridge. Per
commit message in change [2], "l3 agent can handle any networks by
setting the neutron parameter external_network_bridge and
gateway_external_network_id to empty". So the consequence of [1] was to
introduce a regression whereby multiple external networks are not
supported by the L3 agent anymore.
This change proposes a new simplified rule. If
gateway_external_network_id is defined, that is the network that the L3
agent will use. If not and multiple external networks exist, the L3
agent will handle any of them.
[1] https://review.opendev.org/#/c/567369/
[2] https://review.opendev.org/#/c/59359
Change-Id: Idd766bd069eda85ab6876a78b8b050ee5ab66cf6
Closes-Bug: #1824571
RPC notifier method can sometimes be time-consuming,
this will cause other parallel processing resources
fail to send notifications in time. This patch changes
the notify to asynchronous.
Closes-Bug: #1824911
Change-Id: I3f555a0c78fbc02d8214f12b62c37d140bc71da1
Currently, most implementations override the L3NatAgent class itself
for their own logic since there is no proper interface to extend
RouterInfo class. This adds unnecessary complexity for developers
who just want to extend router mechanism instead of whole RPC.
Add a RouterFactory class that developer can registers RouterInfo class
and delegate it for RouterInfo creation. Seperate functions and variables
which currently used externally to abstract class from RouterInfo, so that
extension can use the basic interface.
Provide the router registration function to the l3 extension API so that
extension can extend RouterInfo itself which correspond to each features
(ha, distribtued, ha + distributed)
Depends-On: https://review.openstack.org/#/c/620348/
Closes-Bug: #1804634
Partially-Implements: blueprint openflow-based-dvr
Change-Id: I1eff726900a8e67596814ca9a5f392938f154d7b
Add a unique id for resource update, then we can calculate
the resource processing time and track it.
Related-Bug: #1825152
Related-Bug: #1824911
Related-Bug: #1821912
Related-Bug: #1813787
Change-Id: Ib4d197c6c180c32860964440882393794aabb6ef
All of the externally consumed variables from neutron.common.constants
now live in neutron-lib. This patch removes neutron.common.constants
and switches all uses over to lib.
NeutronLibImpact
Depends-On: https://review.openstack.org/#/c/647836/
Change-Id: I3c2f28ecd18996a1cee1ae3af399166defe9da87
This option is deprecated and marked to be deleted in Ocata. So
as we are now in Stein development cycle I think that it's good time
to remove it.
Change-Id: I07474713206c218710544ad98c08caaa37dbf53a
Removing an active or a standby HA router from an agent that has a
valid DVR serviceable port (such as DHCP), does not remove the
HA interface associated with the Router in the SNAT namespace.
When we try to add the HA router back to the agent, then it
adds more than one HA interface to the SNAT Namespace causing
more problems and we sometimes also see multiple active routers.
This bug might have been introduced by this patch [1].
Fix the problem by just adding the router namespaces without HA
interfaces when there is no HA and re-insert the HA interfaces
when HA router is bound to the agent into the namespace.
[1] https://review.openstack.org/#/c/522362/
Closes-Bug: #1816698
Change-Id: Ie625abcb73f8185bb2bee06dcd26a01d8af0b0d1
I noticed in the functional logs that the l3-agent is constantly
logging this message, even when just adding or removing a single
router:
Resizing router processing queue green pool size to: 8
It's misleading as the pool is not being resized, it's still 8,
so let's only log when we're actually changing the pool size.
Change-Id: I5dc42fa4b4c1964b7d027681b61550cd82e83234
If l3-agent was restarted by a regular action, such as config change,
package upgrade, manually service restart etc. We should not set the
HA port down during such scenarios. Unless the physical host was
rebooted, aka the VRRP processes were all terminated.
This patch adds a new RPC call during l3 agent init, it will try to
retrieve the HA router count first. And then compare the VRRP process
(keepalived) count and 'neutron-keepalived-state-change' count
with the hosting router count. If the count matches, then that
set HA port to 'DOWN' state action will not be triggered anymore.
Closes-Bug: #1798475
Change-Id: I5e2bb64df0aaab11a640a798963372c8d91a06a8
There is a race condition between nova-compute boots instance and
l3-agent processes DVR (local) router in compute node. This issue
can be seen when a large number of instances were booted to one
same host, and instances are under different DVR router. So the
l3-agent will concurrently process all these dvr routers in this
host at the same time.
For now we have a green pool for the router ResourceProcessingQueue
with 8 greenlet, but some of these routers can still be waiting, event
worse thing is that there are time-consuming actions during the router
processing procedure. For instance, installing arp entries, iptables
rules, route rules etc.
So when the VM is up, it will try to get meta via the local proxy
hosting by the dvr router. But the router is not ready yet in that
host. And finally those instances will not be able to setup some
config in the guest OS.
This patch adds a new measurement based on the router quantity to
indicate the L3 router process queue green pool size. The pool size
will be limit from 8 (original value) to 32, because we do not want
the L3 agent cost too much host resource on processing router in the
compute node.
Related-Bug: #1813787
Change-Id: I62393864a103d666d5d9d379073f5fc23ac7d114
The neutron.common.rpc module has been in neutron-lib for awhile now and
neutron is shimmed to use neutron-lib already.
This patch removes neutron.common.rpc and switches the code over to use
neutron-lib's implementation where needed.
NeutronLibImpact
Change-Id: I733f07a8c4a2af071b3467bd710290eee11a4f4c
In case of an l3 agent sync it is important to understand when
a router is processing an update to identify when it applies
changes that can cause failovers.
Change-Id: Ie9ba2a8ffebfcc3bfb35f7a48f73a25352309b4e
Today the neutron common exceptions already live in neutron-lib and are
shimmed from neutron. This patch removes the neutron.common.exceptions
module and changes neutron's imports over to use their respective
neutron-lib exception module instead.
NeutronLibImpact
Change-Id: I9704f20eb21da85d2cf024d83338b3d94593671e
If the L3 agent fails to send its report_state
to the server, it logs an exception:
Failed reporting state!: MessagingTimeout: Timed out...
If it then tries a second time and succeeds it just
goes on happily. It would be nice if it logged that
it had success on the subsequent attempt so someone
looking at the logs know it recovered.
Change-Id: I9019782588caffcb647cf1fd557f76ce89cea254
The L3AgentExtension class delete_router() method expects a
dict as it's 'data' argument, but the l3-agent code that
deletes a router was passing just the router ID. Change to
correctly pass a router dictionary if one exists.
Change-Id: I112d1f8dce9defddfbd8fbfa75bf538e308e1561
Closes-bug: #1809134
In case when 2 dvr routers are connected to each other with
tenant network, those routers needs to be always deployed
on same compute nodes.
So this patch changes dvr routers scheduler that it will create
dvr router on each host on which there are vms or other dvr routers
connected to same subnets.
Co-Authored-By: Swaminathan Vasudevan <SVasudevan@suse.com>
Closes-Bug: #1786272
Change-Id: I579c2522f8aed2b4388afacba34d9ffdc26708e3
This patch switches callbacks over to the payload object style events
[1] for ROUTER and ROUTER_GATEWAY BEFORE_DELETE based notifications. To
do so a DBEventPayload object is used with the publish() method to pass
along the related data.
NeutronLibImpact
[1] https://docs.openstack.org/neutron-lib/latest/contributor/callbacks.html#event-payloads
Change-Id: I3ce4475643f4f0afed01f2e9956b3bf84714e6f2
Moved the router processing queue code to the agent/common
directory and renamed it "resource processing queue". This
way it can be consumed by other agents, or possibly even
moved to neutron-lib in the future.
Change-Id: I735cf5b0a915828c420c3316b78a48f6d54035e6
Right now, ha_state could return any value that is in
the state file, or even '' if the file is empty. Instead,
return 'unknown' if it's empty.
We also need to update the translation map in the HA code
to deal with this new value to avoid a KeyError.
Related-bug: #1755243
Change-Id: I94a39e574cf4ff5facb76df352c14cbaba793e98
Fix W503 (line break before binary operator) pep8 warnings
and no longer ignore new failures.
Trivialfix
Change-Id: I7539f3b7187f2ad40681781f74b6e05a01bac474
l3-agent checks the HA state of routers when a router is updated.
To ensure that the HA state is only checked on HA routers the following
check is performed: `if router.get('ha') and not is_dvr_only_agent`.
This check should ensure that the check is only performed on
DvrEdgeHaRouter and HaRouter objects.
Unfortunately, there are cases where we have DvrEdgeRouter objects
running on 'dvr_snat' agents. E.g. when deploying a loadbalancer with
neutron-lbaas in a landscape with 6 network nodes and
max_l3_agents_per_router set to 3, it may happen that the loadbalancer
is placed on a network node that does not have a DvrEdgeHaRouter running
on it. In such a case, neutron will deploy a DvrEdgeRouter object on the
network node to serve the loadbalancer, just like it would deploy a
DvrEdgeRouter on a compute node when deploying a VM.
Under such circumstances each update to the router will lead to an
AttributeError, because the DvrEdgeRouter object does not have the
ha_state attribute.
This patch circumvents the issue by doing an additional check on the
router object to ensure that it actually has the ha_state attribute.
Change-Id: I755990324db445efd0ee0b8a9db1f4d7bfb58e26
Closes-Bug: #1755243
The neutron.common.topics module was rehomed into neutron-lib with
commit Ie88b84949cbd55a4e7ad06341aab77b286cdc485
This patch consumes it by removing the rehomed module from neutron
and using the module from neutron-lib instead.
NeutronLibImpact
Change-Id: Ia4a4604c259ce862597de80c6deeb3d408bf0e95
The L3_AGENT_MODE_DVR_NO_EXTERNAL and DVR_SNAT_BOUND constants were
rehomed into neutron-lib with Ieb9374f5483a0ab2306592ab901686ca374db1c8
This patch consumes them by removing them from neutron and using the
constants from neutron-lib instead.
NeutronLibImpact
Change-Id: Ib63a523721a2fa3d1a978a729de28e6a2e560ef6
Before this change, DVR_SNAT agents would get no routers when
asking for updates due to provisioning of DHCP ports on the
node they are running on. This means that there's no connectivity
between the DHCP port and the network gateway (that may be
hosted on a different node), and therefore things like DNS may
break when a VM attempts resolution when talking to the affected
DHCP port.
This change relaxed a conditional that prevents the right list of
routers to be compiled and returned from the server to the agent.
The agent on the other hand needs to make sure to allocate the
right type of router based on what is being returned from the server.
Closes-bug: #1733987
Change-Id: I6124738c3324e0cc3f7998e3a541ff7547f2a8a7
As explained in bug [1] when l3 agent fails to report state to the
server, its state is set to AGENT_REVIVED, triggering
fetch_and_sync_all_routers, which will set all its HA network ports
to DOWN, resulting in
1) ovs agent rewiring these ports and setting status to ACTIVE
2) when these ports are active, server sends router update to l3 agent
As server, ovs and l3 agents are busy with this processing, l3 agent
may fail again reporting state, repeating this process.
As l3 agent is repeatedly processing same routers, SIGHUPs are
frequently sent to keepalived, resulting in multiple masters.
To fix this, we call update_all_ha_network_port_statuses in l3 agent
start instead of calling from fetch_and_sync_all_routers.
[1] https://bugs.launchpad.net/neutron/+bug/1731595/comments/7
Change-Id: Ia9d5549f7d53b538c9c9f93fe6aa71ffff15524a
Related-bug: #1597461
Closes-Bug: #1731595
As soon as we call router_info.initialize(), we could
possibly try and process a router. If it is HA, and
we have not fully initialized the HA port or keepalived
manager, we could trigger an exception.
Move the call to check_ha_state_for_router() into the
update notification code so it's done after the router
has been created. Updated the functional tests for this
since the unit tests are now invalid.
Also added a retry counter to the RouterUpdate object so
the l3-agent code will stop re-enqueuing the same update
in an infinite loop. We will delete the router if the
limit is reached.
Finally, have the L3 HA code verify that ha_port and
keepalived_manager objects are valid during deletion since
there is no need to do additional work if they are not.
Change-Id: Iae65305cbc04b7af482032ddf06b6f2162a9c862
Closes-bug: #1726370
Since ri.ex_gw_port can be None, the l3-agent can throw an
exception when looking for ports it might have in a given
network.
Change-Id: I3ab3e9c012022cd7eefa5c609ca9540649079ad3
Closes-bug: #1724043