When the L3 agent get a router update notification, it will try to
retrieve the router info from neutron server. But at this time, if
the message queue is down/unreachable. It will get exceptions related
message queue. The resync actions will be run then. Sometimes, rabbitMQ
cluster is not so much easy to recover. Then Long time MQ recover time
will cause the router info sync RPC never get successful until it meets
the max retry time. Then the bad thing happens, L3 agent is trying to
remove the router now. It basically shutdown all the existing L3 traffic
of this router.
This patch directly removes the final router removal action, let the
router run as it is.
Closes-Bug: #1871850
Change-Id: I9062638366b45a7a930f31185cd6e23901a43957
(cherry picked from commit 12b9149e20)
There is a race condition between nova-compute boots instance and
l3-agent processes DVR (local) router in compute node. This issue
can be seen when a large number of instances were booted to one
same host, and instances are under different DVR router. So the
l3-agent will concurrently process all these dvr routers in this
host at the same time.
For now we have a green pool for the router ResourceProcessingQueue
with 8 greenlet, but some of these routers can still be waiting, event
worse thing is that there are time-consuming actions during the router
processing procedure. For instance, installing arp entries, iptables
rules, route rules etc.
So when the VM is up, it will try to get meta via the local proxy
hosting by the dvr router. But the router is not ready yet in that
host. And finally those instances will not be able to setup some
config in the guest OS.
This patch adds a new measurement based on the router quantity to
indicate the L3 router process queue green pool size. The pool size
will be limit from 8 (original value) to 32, because we do not want
the L3 agent cost too much host resource on processing router in the
compute node.
Conflicts:
neutron/tests/functional/agent/l3/test_legacy_router.py
Related-Bug: #1813787
Change-Id: I62393864a103d666d5d9d379073f5fc23ac7d114
(cherry picked from commit 837c9283ab)
As described in the bug, when a HA router transitions from "master" to
"backup", "keepalived" processes will set the virtual IP in all other
HA routers. Each HA router will then advert it and "keepalived" will
decide, according to a trivial algorithm (higher interface IP), which
one should be "master". At this point, the other "keepalived" processes
running in the other servers, will remove the HA router virtual IP
assigned an instant before
To avoid transitioning some routers form "backup" to "master" and then
to "backup" in a very short period, this patch delays the "backup" to
"master" transition, waiting for a possible new "backup" state. If
during the waiting period (set to the HA VRRP advert time, 2 seconds
default) to set the HA state to "master", the L3 agent receives a new
"backup" HA state, the L3 agent does nothing.
Conflicts:
neutron/agent/l3/agent.py
Closes-Bug: #1837635
Change-Id: I70037da9cdd0f8448e0af8dd96b4e3f5de5728ad
(cherry picked from commit 3f022a193f)
(cherry picked from commit adac5d9b7a)
In fetch_and_sync_all_routers method is used python's range function.
Range function accepts integers.
This patch is fixing divide behaviour in py3 where result number is float,
by retyping float to int as it is represented in py2.
Change-Id: Ifffdee0d4a3226d4871cfabd0bdbf13d7058a83e
Closes-Bug: #1824334
(cherry picked from commit 49a66dba31)
(cherry picked from commit 7039113990)
RPC notifier method can sometimes be time-consuming,
this will cause other parallel processing resources
fail to send notifications in time. This patch changes
the notify to asynchronous.
Closes-Bug: #1824911
Change-Id: I3f555a0c78fbc02d8214f12b62c37d140bc71da1
(cherry picked from commit 0f471a47c0)
Moved the router processing queue code to the agent/common
directory and renamed it "resource processing queue". This
way it can be consumed by other agents, or possibly even
moved to neutron-lib in the future.
Conflicts:
neutron/agent/l3/agent.py
neutron/tests/unit/agent/l3/test_agent.py
Change-Id: I735cf5b0a915828c420c3316b78a48f6d54035e6
(cherry picked from commit f24f3b6b7b)
Removing an active or a standby HA router from an agent that has a
valid DVR serviceable port (such as DHCP), does not remove the
HA interface associated with the Router in the SNAT namespace.
When we try to add the HA router back to the agent, then it
adds more than one HA interface to the SNAT Namespace causing
more problems and we sometimes also see multiple active routers.
This bug might have been introduced by this patch [1].
Fix the problem by just adding the router namespaces without HA
interfaces when there is no HA and re-insert the HA interfaces
when HA router is bound to the agent into the namespace.
[1] https://review.openstack.org/#/c/522362/
Closes-Bug: #1816698
Change-Id: Ie625abcb73f8185bb2bee06dcd26a01d8af0b0d1
(cherry picked from commit d9e0bab6ac)
In case when 2 dvr routers are connected to each other with
tenant network, those routers needs to be always deployed
on same compute nodes.
So this patch changes dvr routers scheduler that it will create
dvr router on each host on which there are vms or other dvr routers
connected to same subnets.
Co-Authored-By: Swaminathan Vasudevan <SVasudevan@suse.com>
Closes-Bug: #1786272
Change-Id: I579c2522f8aed2b4388afacba34d9ffdc26708e3
(cherry picked from commit 5018d70241)
This patch switches callbacks over to the payload object style events
[1] for ROUTER and ROUTER_GATEWAY BEFORE_DELETE based notifications. To
do so a DBEventPayload object is used with the publish() method to pass
along the related data.
NeutronLibImpact
[1] https://docs.openstack.org/neutron-lib/latest/contributor/callbacks.html#event-payloads
Change-Id: I3ce4475643f4f0afed01f2e9956b3bf84714e6f2
Right now, ha_state could return any value that is in
the state file, or even '' if the file is empty. Instead,
return 'unknown' if it's empty.
We also need to update the translation map in the HA code
to deal with this new value to avoid a KeyError.
Related-bug: #1755243
Change-Id: I94a39e574cf4ff5facb76df352c14cbaba793e98
Fix W503 (line break before binary operator) pep8 warnings
and no longer ignore new failures.
Trivialfix
Change-Id: I7539f3b7187f2ad40681781f74b6e05a01bac474
l3-agent checks the HA state of routers when a router is updated.
To ensure that the HA state is only checked on HA routers the following
check is performed: `if router.get('ha') and not is_dvr_only_agent`.
This check should ensure that the check is only performed on
DvrEdgeHaRouter and HaRouter objects.
Unfortunately, there are cases where we have DvrEdgeRouter objects
running on 'dvr_snat' agents. E.g. when deploying a loadbalancer with
neutron-lbaas in a landscape with 6 network nodes and
max_l3_agents_per_router set to 3, it may happen that the loadbalancer
is placed on a network node that does not have a DvrEdgeHaRouter running
on it. In such a case, neutron will deploy a DvrEdgeRouter object on the
network node to serve the loadbalancer, just like it would deploy a
DvrEdgeRouter on a compute node when deploying a VM.
Under such circumstances each update to the router will lead to an
AttributeError, because the DvrEdgeRouter object does not have the
ha_state attribute.
This patch circumvents the issue by doing an additional check on the
router object to ensure that it actually has the ha_state attribute.
Change-Id: I755990324db445efd0ee0b8a9db1f4d7bfb58e26
Closes-Bug: #1755243
The neutron.common.topics module was rehomed into neutron-lib with
commit Ie88b84949cbd55a4e7ad06341aab77b286cdc485
This patch consumes it by removing the rehomed module from neutron
and using the module from neutron-lib instead.
NeutronLibImpact
Change-Id: Ia4a4604c259ce862597de80c6deeb3d408bf0e95
The L3_AGENT_MODE_DVR_NO_EXTERNAL and DVR_SNAT_BOUND constants were
rehomed into neutron-lib with Ieb9374f5483a0ab2306592ab901686ca374db1c8
This patch consumes them by removing them from neutron and using the
constants from neutron-lib instead.
NeutronLibImpact
Change-Id: Ib63a523721a2fa3d1a978a729de28e6a2e560ef6
Before this change, DVR_SNAT agents would get no routers when
asking for updates due to provisioning of DHCP ports on the
node they are running on. This means that there's no connectivity
between the DHCP port and the network gateway (that may be
hosted on a different node), and therefore things like DNS may
break when a VM attempts resolution when talking to the affected
DHCP port.
This change relaxed a conditional that prevents the right list of
routers to be compiled and returned from the server to the agent.
The agent on the other hand needs to make sure to allocate the
right type of router based on what is being returned from the server.
Closes-bug: #1733987
Change-Id: I6124738c3324e0cc3f7998e3a541ff7547f2a8a7
As explained in bug [1] when l3 agent fails to report state to the
server, its state is set to AGENT_REVIVED, triggering
fetch_and_sync_all_routers, which will set all its HA network ports
to DOWN, resulting in
1) ovs agent rewiring these ports and setting status to ACTIVE
2) when these ports are active, server sends router update to l3 agent
As server, ovs and l3 agents are busy with this processing, l3 agent
may fail again reporting state, repeating this process.
As l3 agent is repeatedly processing same routers, SIGHUPs are
frequently sent to keepalived, resulting in multiple masters.
To fix this, we call update_all_ha_network_port_statuses in l3 agent
start instead of calling from fetch_and_sync_all_routers.
[1] https://bugs.launchpad.net/neutron/+bug/1731595/comments/7
Change-Id: Ia9d5549f7d53b538c9c9f93fe6aa71ffff15524a
Related-bug: #1597461
Closes-Bug: #1731595
As soon as we call router_info.initialize(), we could
possibly try and process a router. If it is HA, and
we have not fully initialized the HA port or keepalived
manager, we could trigger an exception.
Move the call to check_ha_state_for_router() into the
update notification code so it's done after the router
has been created. Updated the functional tests for this
since the unit tests are now invalid.
Also added a retry counter to the RouterUpdate object so
the l3-agent code will stop re-enqueuing the same update
in an infinite loop. We will delete the router if the
limit is reached.
Finally, have the L3 HA code verify that ha_port and
keepalived_manager objects are valid during deletion since
there is no need to do additional work if they are not.
Change-Id: Iae65305cbc04b7af482032ddf06b6f2162a9c862
Closes-bug: #1726370
Since ri.ex_gw_port can be None, the l3-agent can throw an
exception when looking for ports it might have in a given
network.
Change-Id: I3ab3e9c012022cd7eefa5c609ca9540649079ad3
Closes-bug: #1724043
The neutron-lib commit I360545b6ee4291547e0c5c8e668ad03d3efa4725 moved
the externally consumed globals from neutron.common.constants into lib.
With the exception of PROVISIONAL_IPV6_PD_PREFIX all other constants
in neutron.common.constants should only be used in neutron, and will
hopefully remain that way. External consumers needing access to other
common constants should move them into lib first.
NeutronLibImpact
Change-Id: Ie4bcffccf626a6e1de84af01f3487feb825f8b65
All Newton+ servers using L3RouterPlugin expose the endpoint. Even back
in Newton time when the patch that introduced the new RPC entry point,
there was no need to have this special handling, because neutron-server
is always upgraded before agents.
TrivialFix
Change-Id: I2afa84d6b5771600068f8e98c407bbdce2f266b0
Since Pike log messages should not be translated.
This patch removes calls to i18n _LC, _LI, _LE, _LW from
logging logic throughout the code. Translators definition
from neutron._i18n is removed as well.
This patch also removes log translation verification from
ignore directive in tox.ini.
Change-Id: If9aa76fcf121c0e61a7c08088006c5873faee56e
This patch makes L3 agent to update its ports' MTU when it's changed on
core plugin side.
Related-Bug: #1671634
Change-Id: I4444da6358e8b8420a3a365e1107b02f5bb1161d
DVR supports both East/West and North/South routing. While the
SNAT is centralized the DNAT is mostly distributed. There are
certain circumstances where the DNAT might be centralized when
the ports are unbound.
In order to have a well defined behavior and when there are
no external network connectivity available in the compute host,
the DNAT functionality is centralized.
In order to achieve this we are introducing a new agent type
option 'dvr_no_external' to centralize the DNAT.
This new L3 agent type ('dvr_no_external') would only allow the East/West
routing to occur in the compute host and the DNAT or Floating IP will be
configured in the centralized network node.
Change-Id: Ia5d7336e478e0fa5ba62b7ae5ed0c56656116d94
Partial-Bug: #1667877
In commit 500b255278 we are using
"get_router_ids" RPC to update HA network port status. But that
was needed to backport that commit to other branches.
As "get_router_ids" RPC is expected to fetch only router ids and
not to have any other processing, we are adding new RPC
"update_ha_network_port_status". L3 agent will call this new RPC
to set HA network port status to DOWN.
Related-bug: #1597461
Change-Id: I8f34c4f5178d2b422cfcfd082dfc9cf3f89a5d95
Trying to check HA state on a DVR-only compute node
can trigger:
AttributeError: 'DvrLocalRouter' object has no attribute 'ha_state'
Also moved the mode assignment outside of the loops
since it only needs to be done once.
Co-Authored-By: Sean Redmond <sean.redmond1@gmail.com>
Closes-bug: #1691427
Change-Id: I3e48e06e76325939fbc9533b0198924bc96d600e
The callback modules have been available in neutron-lib since commit [1]
and are ready for consumption.
As the callback registry is implemented with a singleton manager
instance, sync complications can arise ensuring all consumers switch to
lib's implementation at the same time. Therefore this consumption has
been broken down:
1) Shim neutron's callbacks using lib's callback system and remove
existing neutron internals related to callbacks (devref, UTs, etc.).
2) Switch all neutron's callback imports over to neutron-lib's.
3) Have all sub-projects using callbacks move their imports over to use
neutron-lib's callbacks implementation.
4) Remove the callback shims in neutron-lib once sub-projects are moved
over to lib's callbacks.
5) Follow-on patches moving our existing uses of callbacks to the new
event payload model provided by neutron-lib.callback.events
This patch implements #2 from above, moving all neutron's callback
imports to use neutron-lib's callbacks.
There are also a few places in the UT code that still patch callbacks,
we can address those in step #4 which may need [2].
NeutronLibImpact
[1] fea8bb64ba7ff52632c2bd3e3298eaedf623ee4f
[2] I9966c90e3f90552b41ed84a68b19f3e540426432
Change-Id: I8dae56f0f5c009bdf3e8ebfa1b360756216ab886
Instead assigning the reraise flag as False in the exception
context, we'd better set it to False initially while
calling the save_and_reraise_exception().
Change-Id: I4c318c92b4ad70c1653b0d26ac171a1216a590f1
Since [1], when the l3 agent does fullsync, for every router, it calls
ensure_snat_cleanup depending on whether the agent is dvr_snat or not.
However, DVR+HA routers always have snat namespaces on dvr_snat agents
holding themselves for keepalived. Therefore, the cleanup call is
unexpected and will cause a series of issues.
This patch ensures that snat namespaces of DVR+HA routers will not be
cleaned when the agent do fullsync.
[1] https://review.openstack.org/#/c/326729/
Change-Id: I5df0a1404f1a80ab0b226d7a60c2885e24247e02
Closes-Bug: #1632540
When router_info initialize() fails(with trace) some resources(
like keepalived process) may not be created. While handling this
exception, l3 agent calls _process_updated_router instead of
again calling _process_added_router, which also fails trying to
access resources which are not created.
In this change, agent will have new router_info(i.e
self.router_info[router_id] = ri) only when initialize() succeeds.
When initialize() fails, as router_info is not part of agent,
"_process_router_if_compatible" will again call initialize().
We also cleanup router_info when initialize() fails.
Closes-bug: #1662804
Change-Id: I278ac83de57713c93d6e50846d79034d774c5d47
If all agents are shown as a standby it is possible changing state
were lost due to problems with RabbitMQ. Current change adds check
for ha state in fetch_and_sync_all_routers. If state is different -
notify server that state should be changed.
Also change _get_bindings_and_update_router_state_for_dead_agents
to set standby for dead agent only in case we have more than one
active.
Change-Id: If5596eb24041ea9fae1d5d2563dcaf655c5face7
Closes-bug:#1648242
agent object is a member of some sub classes of RouterInfo such as
HaRouter. This changeset makes it a member of the RouterInfo class
itself.
Prior to the change, the agent object has been passed in to some
methods of RouterInfo that requires it to access the agent object's
member information. The bugs in concern requires calling the PD object
that is a member of the agent object to get IPs that need to be
preserved in the gateway port. Without this change, signatures of the
methods external_gateway_added() and external_gateway_updated() have
to be modified to pass in the agent object. And any subclass of
RouterInfo that overwrites or uses the methods must be changed as
well. It doesn't seem to make sense considering the subclass such as
HaRouter has the agent object as one of its members already.
The changeset fixes the bugs by preserving the LLAs for prefix
delegation when the gateway port is being updated.
Closes-Bug: #1639042
Closes-Bug: #1640271
Change-Id: I61c6128ed1973deb8440c54234e77a66987d7e28
Since the refactor is complete, let's clean these up and
use neutron-lib constants instead.
Trivialfix
Change-Id: Ic69d59d53ee78a4c6eb0104583755c4145fb8e46
IPv6 utils is_enabled() doesn't actually determine if IPv6 is enabled on
the host. It checks if /proc/sys/net/ipv6/conf/default/disable_ipv6 is
present and is set to 0. This kernel configuration option controls if
the kernel will automatically assign IPv6 link-local addresses to newly
created network interfaces when their link state changes to up. The
existence of this /proc files does indicate that the Linux kernel has
the ipv6 module loaded or ipv6 was compiled in. Having this /proc file
set to zero does not indicate IPv6 is not available on the system, just
that newly created interfaces will inherit this configuration and will
not have IPv6 addresses bound to them unless the administrator changes
the interfaces specific /proc/sys/net/ipv6/conf/$IFACE/disable_ipv6
configuration.
This check was added to Neutron so it could operate with distributions
which didn't load the ipv6 kernel module, preventing errors when
attempting to make IPv6 specific configurations in the iptables firewall
driver and the L3 agent. Removing it would break existing deployments.
Renaming this function to provide clarity for complex conditions tested
by this function. In fact it is a good security practice to set this
default disable_ipv6 option to 1, and explicitly enable IPv6 by setting
disable_ipv6=0 on individual interfaces which the administrator intends
to bind IPv6 addresses on. This establishes parity with IPv4 behavior
where interfaces are not active in an address family until the
administrator explicitly configures them to be active in that address
family. This practice does not currently work as expected with the
Neutron, since setting /proc/sys/net/ipv6/conf/default/disable_ipv6 to 1
unexpectedly disables creating IPv6 security group rules leaving
instances completely exposed via IPv6 regardless of security group
rules.
Change-Id: I844b992240a5db642766ec9c04e3b5fcab8e2e23
When everything works as expected, no-one hardly pays any attention
to this log trace, which accounts for an incredible amount of log data.
This change proposes to emit the router payload only during failures
(when debugging info is needed the most), and furthermore it relocates
it to the L3 agent log files, where it is more pertinent.
Partial-bug: #1620864
Change-Id: I64281b963ba52c0a100a6194b7cafc5e9b1a8e74
Generate a new context object request-id for each reference
to self.context. This allows easier tracking of requests
in logs.
This is the L3 agent equivalent fix of
I1d6dc28ba4752d3f9f1020851af2960859aae520.
Related-Bug: #1618231
Closes-Bug: #1619524
Change-Id: I4a49f05ce0e7467084a1c27a64a0d4cf60a5f8cb
In L2 agent extensions, when the agent extension needed access to a
datastructure within the L2 agent, an agent extension API object was created.
This API object would be the interface permitting agent extensions to have
access to those objects internal to the L2 agent.
This change implements a similar agent extension API object for the L3 agent
extensions. This is necessary to allow L3 agent extensions to have access to
the RouterInfo class, so that they can do lookups on it, for example
determining the namespace for a specific router. Without this API object, the
L3 agent extension would not have access to this structure.
Co-Authored-By: Margaret Frances <margaret_frances@cable.comcast.com>
Partially-Implements: blueprint l3-agent-extensions
Change-Id: I85f89accbeefd820130335674fd56cb54f1449de
Remove deprecation warnings for various constants
and exceptions that have moved to neutron_lib.
Fix miscellaneous other deprecations.
Uses constants instead of l3_constants when importing
neutron-lib constants.
Co-Authored By: Henry Gessau <gessau@gmail.com>
Co-Authored By: Gary Kotton <gkotton@vmware.com>
Change-Id: Ib0e8ff5c3e23677c1009241a1818cbc8a3430c38
Using the generalized agent extension mechanism, create an agent extension
manager in the L3 agent, so that the L3 agent can load agent extensions.
Co-Authored-By: Margaret Frances <margaret_frances@cable.comcast.com>
Implements: blueprint l3-agent-extensions
Needed-By: Iff506bd11b83d396305e631f3dd95d44cf38fd63
Change-Id: I6da92cb8b9fcbb603e120eababcf4ce711da3e30